Archive System Metadata Structure

This document outlines the structure of the metadata used in our archive system. Each document in the archive is enriched with AI-generated metadata that enhances search and retrieval functionalities. The metadata provides both physical and contextual attributes for various types of files, including text documents, images, videos, and audio files.

Usage

This metadata structure is used by our archive system to enrich the files stored in the database. It allows for more precise search queries, better retrieval of relevant documents, and a more detailed overview of the stored files.

By leveraging AI to generate detailed metadata, we ensure that users can find exactly what they are looking for, even across different types of media such as text, images, videos, and audio files.

The table below shows the analysis of the dataset across different types of media (audio, document, image, video) for various topics. Each value represents the number of items available in the dataset for a specific topic and type.

Topic	Audio	Document	Image	Video
Animals in the wild	8	6	10	8
Celebrities	8	4	9	10
Cultural celebrations and traditions	7	10	12	9
Gourmet dishes and culinary arts	7	7	14	8
Majestic landscapes	10	6	14	12
Political events	7	9	10	7
Scene of movies	9	6	10	10
Sport events	11	7	14	10
Study and work	7	8	12	11
Urban cityscapes	12	9	15	11

Metadata Attributes

The following table lists the attributes used to describe documents in the archive system, along with descriptions and examples:

Attribute	Description	Example	Metadata Type
`id`	A unique identifier for the document.	58677	Physical metadata
`md5`	The MD5 hash of the document, used for ensuring data integrity.	"460d9e165fc61630fd62a"
`extension`	The file extension of the document, indicating the document type (e.g., docx, pdf, txt for documents; jpeg, mp4 for images/videos).	"pdf"
`file_path`	The file path where the document is stored	"/workspace/custom_llm_exp/database/data/ai_data/audio_1715782411.json"
`creation_date`	File creation date in YYYY-MM-DD format	"2024-08-28"
`last_modified_date`	Last modified date in YYYY-MM-DD format	"2024-08-30"
`size`	The size of the document in bytes.	119436
`height`	The height of the document in pixels (applicable for images or videos; set to 0 for other types).	1184 (for images/videos), 0 (for others)
`width`	The width of the document in pixels (applicable for images or videos; set to 0 for other types).	800 (for images/videos), 0 (for others)
`duration`	The duration of the document in seconds (applicable for audio or video files; set to 0 for other types).	120 (for audio/video), 0 (for others)
`density`	The density of the document (e.g., dots per inch for images; set to 0 for other types).	300 (for images), 0 (for others)
`channels`	The number of channels (e.g., color channels in images or audio channels; set to 0 for documents).	3 (for images), 0 (for documents)
`displayRotate`	The display rotation of the document (degrees of rotation, applicable for images/videos; set to 0 for other types).	90 (for images/videos), 0 (for others)
`originalName`	The original name of the document.	"research_paper_computing.pdf"	Custom metadata
`category`	The category of the document, including folder ID and title where it is stored.	{ "id": 42, "title": "Research Papers" }	Custom metadata
`desc`	A brief description of the document content.	"A comprehensive research paper on quantum computing."	AI metadata
`textData`	The content of the document. It should be detailed and relevant to the topic, with a minimum of 300 words (only applicable for documents).	{"This document discusses advancements in artificial intelligence and machine learning, focusing on..."}
`stt`	A transcript of spoken text (if applicable), including timestamps and speaker information.	List of `SttData` objects (for audio/video)
`narrationStt`	An object containing structured data for narration transcripts (applicable for audio/video; set to {None} for other types).	{ "sttData": [...], "id": 12345 } (for audio/video), None (for others)
`people`	A list of people mentioned in the document metadata, with detailed information.	{ "id": 1, "name": "John Doe", "dateOfBirth": "1990-01-01" }
`organizations`	A list of organizations mentioned in the document metadata, with detailed information.	{ "id": 1, "name": "OpenAI" }

Metadata Example

Here is an example JSON object representing the metadata for a sample document:

{
  "id": 58677,
  "md5": "460d9e165fc61630fd62a",
  "extension": "pdf",
  "size": 119436,
  "height": 0,
  "width": 0,
  "duration": 0,
  "density": 0,
  "channels": 0,
  "displayRotate": 0,
  "originalName": "research_paper_computing.pdf",
  "desc": "A comprehensive research paper on quantum computing.",
  "textData": {"This document discusses advancements in artificial intelligence and machine learning, focusing on..."},
  "stt": null,
  "narrationStt": null,
  "category": { "id": 42, "title": "Research Papers" },
  "people": [{ "id": 1, "name": "John Doe", "dateOfBirth": "1990-01-01" }],
  "organizations": [{ "id": 1, "name": "OpenAI" }]
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
generated_data		generated_data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Archive System Metadata Structure

Usage

Metadata Attributes

Metadata Example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

AnhHoang0529/archive_smart_search

Folders and files

Latest commit

History

Repository files navigation

Archive System Metadata Structure

Usage

Metadata Attributes

Metadata Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages