diff --git a/open-source/ingestion/supported-file-types.mdx b/open-source/ingestion/supported-file-types.mdx index 78f10082..18cbcd6e 100644 --- a/open-source/ingestion/supported-file-types.mdx +++ b/open-source/ingestion/supported-file-types.mdx @@ -4,6 +4,6 @@ title: Supported file types The Unstructured Ingest CLI and Unstructured Ingest Python library support processing of the following file types: -import SupportedFileTypesPlatform from '/snippets/general-shared-text/supported-file-types-platform.mdx'; +import SupportedFileTypesPlatform from '/snippets/general-shared-text/supported-file-types-open-source.mdx'; \ No newline at end of file diff --git a/snippets/general-shared-text/platform-partitioning-strategies.mdx b/snippets/general-shared-text/platform-partitioning-strategies.mdx index 8aeca25a..2a3ae5a0 100644 --- a/snippets/general-shared-text/platform-partitioning-strategies.mdx +++ b/snippets/general-shared-text/platform-partitioning-strategies.mdx @@ -7,13 +7,14 @@ strategies other than **Auto** for sets of documents of different types could pr including reduction in transformation quality. - **VLM**: For the highest-quality transformation of these file types: `.bmp`, `.gif`, `.heic`, `.jpeg`, `.jpg`, `.pdf`, `.png`, `.tiff`, and `.webp`. -- **High Res**: For all other [supported file types](/ui/supported-file-types), and for the generation of bounding box coordinates. +- **High Res**: For all other [supported file types](/ui/supported-file-types) except video and audio files, and for the generation of bounding box coordinates. - **Fast**: For text-only documents. +- **Multimedia**: For video and audio files. -The **Auto** partitioning strategy routes each file as a complete unit to the appropriate partitioning strategy (**VLM**, **High Res**, or **Fast**) +The **Auto** partitioning strategy routes each file as a complete unit to the appropriate partitioning strategy (**VLM**, **High Res**, **Fast**, or **Multimedia**) based on the preceding file types. Additionally, for `.pdf` files, the **Auto** partitioning strategy routes these files' pages on a page-by-page basis, as follows: - A page is routed to **Fast** when it contains only embedded text and no images or tables are detected. - All other kinds of pages are routed to **VLM** or **High Res**, depending on the complexity of a page's - content. Unstructured constantly optimizes its proprietary algorithm for routing to **VLM** or **High Res** in these cases. \ No newline at end of file + content. Unstructured constantly optimizes its proprietary algorithm for routing to **VLM** or **High Res** in these cases. diff --git a/snippets/general-shared-text/supported-file-types-platform.mdx b/snippets/general-shared-text/supported-file-types-platform.mdx index d466082a..af124bb2 100644 --- a/snippets/general-shared-text/supported-file-types-platform.mdx +++ b/snippets/general-shared-text/supported-file-types-platform.mdx @@ -2,6 +2,8 @@ By file extension: | File extension | | --- | +| `.3gp` | +| `.aac` | | `.abw` | | `.bmp` | | `.csv` | @@ -16,21 +18,32 @@ By file extension: | `.epub` | | `.et` | | `.eth` | +| `.flac` | +| `.flv` | | `.fods` | | `.heic` | -| `.htm` | +| `.htm` | | `.html` | | `.hwp` | | `.jpeg` | | `.jpg` | +| `.m4a` | | `.md` | | `.mcw` | +| `.mov` | +| `.mp3` | +| `.mp4` | +| `.mpeg` | +| `.mpg` | | `.msg` | | `.mw` | | `.odt` | +| `.ogg` | +| `.opus` | | `.org` | | `.p7s` | | `.pbd` | +| `.pcm` | | `.pdf` | | `.png` | | `.pot` | @@ -45,6 +58,9 @@ By file extension: | `.tiff` | | `.txt` | | `.tsv` | +| `.wav` | +| `.webm` | +| `.wmv` | | `.xls` | | `.xlsx` | | `.xml` | @@ -54,6 +70,7 @@ By file type: | Category | File types | | --- | --- | +| Audio | `.aac`, `.flac`, `.m4a`, `.mp3`, `.mp4`, `.ogg`, `.opus`, `.pcm`, `.wav`, `.webm` | | Apple | `.cwk`, `.mcw` | CSV | `.csv` | | Data Interchange | `.dif`* | @@ -74,6 +91,7 @@ By file type: | Spreadsheet | `.et`, `.fods`, `.mw`, `.xls`, `.xlsx` | | StarOffice | `.sxg` | | TSV | `.tsv` | +| Video | `.3gp`, `.flv`, `.mov`, `.mp4`, `.mpeg`, `.mpg`, `.webm`, `.wmv` | | Word processing | `.abw`, `.doc`, `.docx`, `.dot`, `.dotm`, `.hwp`, `.zabw` | | XML | `.xml` | diff --git a/ui/document-elements.mdx b/ui/document-elements.mdx index 44d93dbe..9331c96b 100644 --- a/ui/document-elements.mdx +++ b/ui/document-elements.mdx @@ -52,23 +52,29 @@ of the file and not care about its headers and footers. You can easily filter ou Here are some examples of the element types your file might contain: | Element type | Description | -|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------| -| `Address` | A text element for capturing physical addresses. | -| `CodeSnippet` | A text element for capturing code snippets. | -| `EmailAddress` | A text element for capturing email addresses. | -| `FigureCaption` | An element for capturing text associated with figure captions. | -| `Footer` | An element for capturing document footers. | -| `FormKeysValues` | An element for capturing key-value pairs in a form. | -| `Formula` | An element containing formulas in a file. | -| `Header` | An element for capturing document headers. | -| `Image` | A text element for capturing image metadata. | -| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. | -| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. | -| `PageBreak` | An element for capturing page breaks. | -| `PageNumber` | An element for capturing page numbers. | -| `Table` | An element for capturing tables. | -| `Title` | A text element for capturing titles. | -| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. | +|--------------------- |------------------------------------------------------------------------------------------------------------------------------------------------------| +| `Address` | A text element for capturing physical addresses. | +| `CodeSnippet` | A text element for capturing code snippets. | +| `EmailAddress` | A text element for capturing email addresses. | +| `FigureCaption` | An element for capturing text associated with figure captions. | +| `Footer` | An element for capturing document footers. | +| `FormKeysValues` | An element for capturing key-value pairs in a form. | +| `Formula` | An element containing formulas in a file. | +| `Header` | An element for capturing document headers. | +| `Image` | A text element for capturing image metadata. | +| `ListItem` | `ListItem` is a `NarrativeText` element that is part of a list. | +| `NarrativeText` | `NarrativeText` is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. | +| `PageBreak` | An element for capturing page breaks. | +| `PageNumber` | An element for capturing page numbers. | +| `SceneDescription` | An element for capturing scene descriptions, for example a description of a scene in a video. | +| `Table` | An element for capturing tables. | +| `Title` | A text element for capturing titles. | +| `TranscriptFragment` | An element for capturing transcription of speech, for example a speaker's words in an audio clip or video. | +| `UncategorizedText` | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. | + + + `SceneDescription` and `TranscriptFragment` are specific to video and audio file processing, which is available only for [self-hosted](/self-hosted/overview) deployments of Unstructured. + If you apply chunking, you will also see the `CompositeElement` type. `CompositeElement` is a chunk formed from text (non-`Table`) elements. @@ -187,6 +193,19 @@ file. Headers and footers in Word files include a `header_footer_type` indicating which page a header or footer applies to. Valid values are `"primary"`, `"even_only"`, and `"first_page"`. +#### Video files + +Elements for video files include a `start_time` and `end_time`, representing the start and end times of a clip of video +from the parent video file to which this element belongs. Also included are the `model_version` representing the model that was used to +generate the element, and the `average_log_probability` representing the model's overall average confidence level for the model's output across the document, with values closer to +zero indicating higher confidence. + +#### Audio files + +Elements for audio files include a `start_time`, `end_time`, and `speaker`, representing the start and end times of a clip of audio +made by a specific speaker, as part of the parent audio file to which this element belongs. +If the speaker cannot be determined, `speaker` is set to `0` or `unknown`. + ### Table-specific metadata For `Table` elements, the raw text of the table will be stored in the `text` attribute for the element, and HTML representation diff --git a/ui/workflows.mdx b/ui/workflows.mdx index 814be027..890340af 100644 --- a/ui/workflows.mdx +++ b/ui/workflows.mdx @@ -69,6 +69,7 @@ By default, this workflow partitions, chunks, and generates embeddings as follow - If the page or document has no images and likely does not have tables, **Fast** partitioning is used, and the page or document is billed at the **Fast** rate for processing. - If the page or document has only a few tables or images with standard layouts and languages, **High Res** partitioning is used, and the page or document is billed at the **High Res** rate for processing. - If the page or document has more than a few tables or images, **VLM** partitioning is used, and the page or document is billed at the **VLM** rate for processing. + - If the page or document is a video or audio file, **Multimedia** partitioning is used. [Learn about partitioning strategies](/ui/partitioning).