Skip to content

AnhHoang0529/archive_smart_search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 

Repository files navigation

Archive System Metadata Structure

This document outlines the structure of the metadata used in our archive system. Each document in the archive is enriched with AI-generated metadata that enhances search and retrieval functionalities. The metadata provides both physical and contextual attributes for various types of files, including text documents, images, videos, and audio files.

Usage

This metadata structure is used by our archive system to enrich the files stored in the database. It allows for more precise search queries, better retrieval of relevant documents, and a more detailed overview of the stored files.

By leveraging AI to generate detailed metadata, we ensure that users can find exactly what they are looking for, even across different types of media such as text, images, videos, and audio files.

The table below shows the analysis of the dataset across different types of media (audio, document, image, video) for various topics. Each value represents the number of items available in the dataset for a specific topic and type.

Topic Audio Document Image Video
Animals in the wild 8 6 10 8
Celebrities 8 4 9 10
Cultural celebrations and traditions 7 10 12 9
Gourmet dishes and culinary arts 7 7 14 8
Majestic landscapes 10 6 14 12
Political events 7 9 10 7
Scene of movies 9 6 10 10
Sport events 11 7 14 10
Study and work 7 8 12 11
Urban cityscapes 12 9 15 11

Metadata Attributes

The following table lists the attributes used to describe documents in the archive system, along with descriptions and examples:

Attribute Description Example Metadata Type
id A unique identifier for the document. 58677 Physical metadata
md5 The MD5 hash of the document, used for ensuring data integrity. "460d9e165fc61630fd62a"
extension The file extension of the document, indicating the document type (e.g., docx, pdf, txt for documents; jpeg, mp4 for images/videos). "pdf"
file_path The file path where the document is stored "/workspace/custom_llm_exp/database/data/ai_data/audio_1715782411.json"
creation_date File creation date in YYYY-MM-DD format "2024-08-28"
last_modified_date Last modified date in YYYY-MM-DD format "2024-08-30"
size The size of the document in bytes. 119436
height The height of the document in pixels (applicable for images or videos; set to 0 for other types). 1184 (for images/videos), 0 (for others)
width The width of the document in pixels (applicable for images or videos; set to 0 for other types). 800 (for images/videos), 0 (for others)
duration The duration of the document in seconds (applicable for audio or video files; set to 0 for other types). 120 (for audio/video), 0 (for others)
density The density of the document (e.g., dots per inch for images; set to 0 for other types). 300 (for images), 0 (for others)
channels The number of channels (e.g., color channels in images or audio channels; set to 0 for documents). 3 (for images), 0 (for documents)
displayRotate The display rotation of the document (degrees of rotation, applicable for images/videos; set to 0 for other types). 90 (for images/videos), 0 (for others)
originalName The original name of the document. "research_paper_computing.pdf" Custom metadata
category The category of the document, including folder ID and title where it is stored. { "id": 42, "title": "Research Papers" }
desc A brief description of the document content. "A comprehensive research paper on quantum computing." AI metadata
textData The content of the document. It should be detailed and relevant to the topic, with a minimum of 300 words (only applicable for documents). {"This document discusses advancements in artificial intelligence and machine learning, focusing on..."}
stt A transcript of spoken text (if applicable), including timestamps and speaker information. List of SttData objects (for audio/video)
narrationStt An object containing structured data for narration transcripts (applicable for audio/video; set to {None} for other types). { "sttData": [...], "id": 12345 } (for audio/video), None (for others)
people A list of people mentioned in the document metadata, with detailed information. { "id": 1, "name": "John Doe", "dateOfBirth": "1990-01-01" }
organizations A list of organizations mentioned in the document metadata, with detailed information. { "id": 1, "name": "OpenAI" }

Metadata Example

Here is an example JSON object representing the metadata for a sample document:

{
  "id": 58677,
  "md5": "460d9e165fc61630fd62a",
  "extension": "pdf",
  "size": 119436,
  "height": 0,
  "width": 0,
  "duration": 0,
  "density": 0,
  "channels": 0,
  "displayRotate": 0,
  "originalName": "research_paper_computing.pdf",
  "desc": "A comprehensive research paper on quantum computing.",
  "textData": {"This document discusses advancements in artificial intelligence and machine learning, focusing on..."},
  "stt": null,
  "narrationStt": null,
  "category": { "id": 42, "title": "Research Papers" },
  "people": [{ "id": 1, "name": "John Doe", "dateOfBirth": "1990-01-01" }],
  "organizations": [{ "id": 1, "name": "OpenAI" }]
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •