diff --git a/api-reference/workflow/workflows.mdx b/api-reference/workflow/workflows.mdx index aedd4d97..65e7df53 100644 --- a/api-reference/workflow/workflows.mdx +++ b/api-reference/workflow/workflows.mdx @@ -2232,6 +2232,74 @@ Allowed values for `subtype` and `model_name` include the following: - `"model_name": "voyage-code-2"` - `"model_name": "voyage-multimodal-3"` +### Extract node + +An **Extract** node has a `type` of `document_data_extractor` and a `subtype` of `llm`. + + + + ```python + embedder_workflow_node = WorkflowNode( + name="Extractor", + subtype="llm", + type="document_data_extractor", + settings={ + "schema_to_extract": { + "json_schema": "", + "extraction_guidance": "" + }, + "provider": "", + "model": "" + } + ) + ``` + + + ```json + { + "name": "Extractor", + "type": "document_data_extractor", + "subtype": "llm", + "settings": { + "schema_to_extract": { + "json_schema": "", + "extraction_guidance": "" + }, + "provider": "", + "model": "" + } + } + ``` + + + +Fields for `settings` include: + +- `schema_to_extract`: _Required_. The schema or guidance for the structured data that you want to extract. One (and only one) of the following must also be specified: + + - `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string. + - `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string. + +- Allowed values for `provider` and `model` include the following: + + - `"provider": "anthropic"` + + - `"model": "..."` + + - `"provider": "azure_openai"` + + - `"model": "..."` + + - `"provider": "bedrock"` + + - `"model": "..."` + + - `"provider": "openai"` + + - `"model": "..."` + +[Learn more](/ui/data-extractor). + ## List templates To list templates, use the `UnstructuredClient` object's `templates.list_templates` function (for the Python SDK) or the `GET` method to call the `/templates` endpoint (for `curl` or Postman). diff --git a/docs.json b/docs.json index 560fc477..a7437fe8 100644 --- a/docs.json +++ b/docs.json @@ -120,6 +120,7 @@ "pages": [ "ui/document-elements", "ui/partitioning", + "ui/data-extractor", "ui/chunking", { "group": "Enriching", diff --git a/img/ui/data-extractor/house-plant-care.png b/img/ui/data-extractor/house-plant-care.png new file mode 100644 index 00000000..b23a8356 Binary files /dev/null and b/img/ui/data-extractor/house-plant-care.png differ diff --git a/img/ui/data-extractor/medical-invoice.png b/img/ui/data-extractor/medical-invoice.png new file mode 100644 index 00000000..b632da26 Binary files /dev/null and b/img/ui/data-extractor/medical-invoice.png differ diff --git a/img/ui/data-extractor/real-estate-listing.png b/img/ui/data-extractor/real-estate-listing.png new file mode 100644 index 00000000..df7fc475 Binary files /dev/null and b/img/ui/data-extractor/real-estate-listing.png differ diff --git a/img/ui/data-extractor/schema-builder.png b/img/ui/data-extractor/schema-builder.png new file mode 100644 index 00000000..d7296640 Binary files /dev/null and b/img/ui/data-extractor/schema-builder.png differ diff --git a/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf b/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf new file mode 100644 index 00000000..3655cce4 Binary files /dev/null and b/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf differ diff --git a/snippets/general-shared-text/data-extractor.mdx b/snippets/general-shared-text/data-extractor.mdx new file mode 100644 index 00000000..ae074cac --- /dev/null +++ b/snippets/general-shared-text/data-extractor.mdx @@ -0,0 +1,30 @@ +1. In the **Schema settings** pane, for **Method**, choose one of the following to extract the source's data into a custom-defined, structured output format: + + - Choose **LLM** to use a large language model (LLM). + - Choose **Regex** (or **JSONPath**) to use regular expressions (or JSONPath expressions). + +2. If you chose **LLM** under **Method**, then continue with step 3 in this procedure. + + If you chose **Regex** (or **JSONPath**) under **Method** instead, then skip ahead to step 6 in this procedure. + +3. If you chose **LLM** under **Method**, then in the **Provider** and **Model** drop-down lists, choose the LLM provider and model that you want to use for the data extraction. +4. For **Extraction fields**, do one of the following: + + - Choose **Suggested** to start with a set of fields that the selected LLM has suggested for the data extraction. + As needed, you can add, change, or delete any of these suggested fields' names, data types, descriptions, or their relationships to other fields within the same schema. + - Choose **Prompt** to provide an AI prompt to the selected LLM to use to generate a set of suggested fields for the data extraction. + To generate the list of suggested fields, click **Generate schema** next to **Prompt**. + As needed, you can add, change, or delete any of these suggested fields' names, data types, descriptions, or their relationships to other fields within the same schema. + - Choose **Create** to manually specify the set of fields for the selected LLM to use for the data extraction. You can specify each field's name, data type, description, and its relationships to other fields within the same schema. + +5. Skip ahead to step 7 in this procedure. +6. If you chose **Regex** (or **JSONPath**) under **Method**, then do one of the following: + + - Choose **Suggested** to start with a set of fields that the default LLM has suggested for the data extraction. + As needed, you can add, change, or delete any of these suggested fields' names, regular expressions (or JSONPath expressions), or their relationships to other fields within the same schema. + - Choose **Prompt** to provide an AI prompt to the default LLM to use to generate a set of suggested fields for the data extraction. + To generate the list of suggested fields, click **Generate schema** next to **Prompt**. + As needed, you can add, change, or delete any of these suggested fields' names, regular expressions (or JSONPath expressions), or their relationships to other fields within the same schema. + - Choose **Create** to manually specify the set of fields for the default LLM to use for the data extraction. You can specify each field's name, regular expression (or JSONPath expression), and its relationships to other fields within the same schema. + +7. Click **Run** to extract the source's data into the custom-defined, structured output format. \ No newline at end of file diff --git a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx index f91c9345..2178874a 100644 --- a/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx +++ b/snippets/general-shared-text/get-started-single-file-ui-part-2.mdx @@ -465,9 +465,252 @@ embedding model that is provided by an embedding provider. For the best embeddin 6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to the workflow designer so that you can continue designing things later as you see fit. +## Step 7: Experiment with structured data extraction + +In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured +automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured +partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured +output key information from the source documents in a custom structured data format, appearing within a `DocumentData` element that contains a JSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on. + +1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**. + + ![Adding an extract node](/img/ui/walkthrough/AddExtract.png) + +2. In the node's settings pane's **Details** tab, in the **Schema** box, enter the following JSON schema: + + ```json + { + "type": "object", + "properties": { + "title": { + "type": "string", + "description": "Full title of the research paper" + }, + "authors": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "Author's full name" + }, + "affiliation": { + "type": "string", + "description": "Author's institutional affiliation" + }, + "email": { + "type": "string", + "description": "Author's email address" + } + }, + "required": [ + "name", + "affiliation", + "email" + ], + "additionalProperties": false + }, + "description": "List of paper authors with their affiliations" + }, + "abstract": { + "type": "string", + "description": "Paper abstract summarizing the research" + }, + "introduction": { + "type": "string", + "description": "Introduction section describing the problem and motivation" + }, + "methodology": { + "type": "object", + "properties": { + "approach_name": { + "type": "string", + "description": "Name of the proposed method (e.g., StrokeNet)" + }, + "description": { + "type": "string", + "description": "Detailed description of the methodology" + }, + "key_techniques": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of key techniques used in the approach" + } + }, + "required": [ + "approach_name", + "description", + "key_techniques" + ], + "additionalProperties": false + }, + "experiments": { + "type": "object", + "properties": { + "datasets": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "Dataset name" + }, + "description": { + "type": "string", + "description": "Dataset description" + }, + "size": { + "type": "string", + "description": "Dataset size (e.g., number of sentence pairs)" + } + }, + "required": [ + "name", + "description", + "size" + ], + "additionalProperties": false + }, + "description": "Datasets used for evaluation" + }, + "baselines": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Baseline methods compared against" + }, + "evaluation_metrics": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Metrics used for evaluation" + }, + "experimental_setup": { + "type": "string", + "description": "Description of experimental configuration and hyperparameters" + } + }, + "required": [ + "datasets", + "baselines", + "evaluation_metrics", + "experimental_setup" + ], + "additionalProperties": false + }, + "results": { + "type": "object", + "properties": { + "main_findings": { + "type": "string", + "description": "Summary of main experimental findings" + }, + "performance_improvements": { + "type": "array", + "items": { + "type": "object", + "properties": { + "dataset": { + "type": "string", + "description": "Dataset name" + }, + "metric": { + "type": "string", + "description": "Evaluation metric (e.g., BLEU)" + }, + "baseline_score": { + "type": "number", + "description": "Baseline method score" + }, + "proposed_score": { + "type": "number", + "description": "Proposed method score" + }, + "improvement": { + "type": "number", + "description": "Improvement over baseline" + } + }, + "required": [ + "dataset", + "metric", + "baseline_score", + "proposed_score", + "improvement" + ], + "additionalProperties": false + }, + "description": "Performance improvements over baselines" + }, + "parameter_reduction": { + "type": "string", + "description": "Description of parameter reduction achieved" + } + }, + "required": [ + "main_findings", + "performance_improvements", + "parameter_reduction" + ], + "additionalProperties": false + }, + "related_work": { + "type": "string", + "description": "Summary of related work and prior research" + }, + "conclusion": { + "type": "string", + "description": "Conclusion section summarizing contributions and findings" + }, + "limitations": { + "type": "string", + "description": "Limitations and challenges discussed in the paper" + }, + "acknowledgments": { + "type": "string", + "description": "Acknowledgments section" + }, + "references": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of cited references" + } + }, + "additionalProperties": false, + "required": [ + "title", + "authors", + "abstract", + "introduction", + "methodology", + "experiments", + "results", + "related_work", + "conclusion", + "limitations", + "acknowledgments", + "references" + ] + } + ``` + +3. Immediately above the **Source** node, click **Test**. +4. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow. +5. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks). +6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer so that you can continue designing things later as you see fit. + ## Next steps -Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing +Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file. diff --git a/snippets/general-shared-text/get-started-single-file-ui.mdx b/snippets/general-shared-text/get-started-single-file-ui.mdx index 6ea82008..3c7c6f54 100644 --- a/snippets/general-shared-text/get-started-single-file-ui.mdx +++ b/snippets/general-shared-text/get-started-single-file-ui.mdx @@ -116,6 +116,7 @@ You can also do the following: What's next? --   [Learn how to add chunking, embeddings, and additional enrichments to your local file results](/ui/walkthrough-2). +-   [Learn how to extract structured data in a custom format from your local file](/ui/data-extractor#use-the-structured-data-extractor-from-the-start-page). +-   [Learn how to add chunking, embeddings, custom structured data extraction, and additional enrichments to your local file results](/ui/walkthrough-2). -   [Learn how to do large-scale batch processing of multiple files and semi-structured data that are stored in remote locations instead](/ui/quickstart#remote-quickstart). -   [Learn more about the Unstructured user interface](/ui/overview). \ No newline at end of file diff --git a/ui/data-extractor.mdx b/ui/data-extractor.mdx new file mode 100644 index 00000000..5c3e3bed --- /dev/null +++ b/ui/data-extractor.mdx @@ -0,0 +1,746 @@ +--- +title: Structured data extraction +--- + + + To begin using the structured data extractor right away, skip ahead to the how-to [procedures](#using-the-structured-data-extractor). + + +## Overview + +When Unstructured [partitions](/ui/partitioning) your source documents, the default result is a list of Unstructured +[document elements](/ui/document-elements). These document elements are expressed in Unstructured's format, which includes elements such as +`Title`, `NarrativeText`, `UncategorizedText`, `Table`, `Image`, `List`, and so on. For example, you could have +Unstructured ingest a stack of customer order forms in PDF format, where the PDF files' layout is identical, but the +content differs per individual PDF by customer order number. For each PDF, Unstructured might output elements such as +a `List` element that contains details about the customer who placed the order, a `Table` element +that contains the customer's order details, `NarrativeText` or `UncategorizedText` elements that contains special +instructions for the order, and so on. You might then use custom logic that you write yourself to parse those elements further in an attempt to +extract information that you're particularly interested in, such as customer IDs, item quantities, order totals, and so on. + +Unstructured's _structured data extractor_ simplifies this kind of scenario by allowing Unstructured to automatically extract the data from your source documents +into a format that you define up front. For example, you could have Unstructured ingest that same stack of customer order form PDFs and +then output a series of customer records, one record per order form. Each record could include data, with associated field labels, such as the customer's ID; a series of order line items with descriptions, quantities, and prices; +the order's total amount; and any other available details that matter to you. +This information is extracted in a consistent JSON format that is already fine-tuned for you to use in your own applications. + +To show how the structured data extractor works, take a look at the following real estate listing PDF. This file is one of the +sample files that is available directly from the **Start** page and the workflow editor's **Source** node in the Unstructured use interface (UI). The file's +content is as follows: + +![Sample real estate listing PDF](/img/ui/data-extractor/real-estate-listing.png) + +Without the structured data extractor, if you run a workflow that references this file, Unstructured extracts the listing's data in a default format similar to the following +(note that the ellipses in this output indicate omitted fields for brevity): + +```json +[ + { + "type": "Title", + "element_id": "3f1ad705648037cf65e4d029d834a0de", + "text": "HOME FOR FUTURE", + "metadata": { + "...": "..." + } + }, + { + "type": "NarrativeText", + "element_id": "320ca4f48e63d8bcfba56ec54c9be9af", + "text": "221 Queen Street, Melbourne VIC 3000", + "metadata": { + "...": "..." + } + }, + { + "type": "NarrativeText", + "element_id": "05f648e815e73fe5140f203a62d8a3cc", + "text": "2,800 sq. ft living space", + "metadata": { + "...": "..." + } + }, + { + "type": "NarrativeText", + "element_id": "27a9ded56b42f559999e48d1dcd76c9e", + "text": "Recently renovated kitchen", + "metadata": { + "...": "..." + } + }, + { + "...": "..." + } +] +``` + +In the preceding output, the `text` fields contain information about the listing, such as the street address, +the square footage, one of the listing's features, and so on. However, +you might want the information presented as `street_address`, `square_footage`, `features`, and so on. + +By using the structured data extractor in your Unstructured workflows, you could have Unstructured extract the listing's data in a custom-defined output format similar to the following (ellipses indicate omitted fields for brevity): + +```json +[ + { + "type": "DocumentData", + "element_id": "f2ee7334-c00a-4fc0-babc-2fcea28c1fb6", + "text": "", + "metadata": { + "...": "...", + "extracted_data": { + "street_address": "221 Queen Street, Melbourne VIC 3000", + "square_footage": 2800, + "price": 1000000, + "features": [ + "Recently renovated kitchen", + "Smart home automation system", + "2-car garage with storage space", + "Spacious open-plan layout with natural lighting", + "Designer kitchen with quartz countertops and built-in appliances", + "Master suite with walk-in closet and en-suite bath", + "Covered patio and landscaped backyard garden" + ], + "agent_contact": { + "phone": "+01 555 123456" + } + } + } + }, + { + "type": "Title", + "element_id": "3f1ad705648037cf65e4d029d834a0de", + "text": "HOME FOR FUTURE", + "metadata": { + "...": "..." + } + }, + { + "...": "..." + } +] +``` + +In the preceding output, the first document element, of type `DocumentData`, has an `extracted_data` field within `metadata` +that contains a representation of the document's data in the custom output format that you specify. Beginning with the second document element and continuing +until the end of the document, Unstructured also outputs the document's data as a series of Unstructured's document elements and metadata as it normally would. + +To use the structured data extractor, you can provide Unstructured with an _extraction schema_, which defines the structure of the data for Unstructured to extract. +Or you can specify an _extraction prompt_ that guides Unstructured on how to extract the data from the source documents, in the format that you want. + +An extraction prompt is like a prompt that you would give to a chatbot or AI agent. This prompt guides Unstructured on how to extract the data from the source documents. For this real estate listing example, the +prompt might look like the following: + +```text +Extract the following information from the listing, and present it in the following format: + +- street_address: The full street address of the property including street number, street name, city, state, and postal code. +- square_footage: The total living space area of the property, in square feet. +- price: The listed selling price of the property, in local currency. +- features: A list of property features and highlights. +- agent_contact: Contact information for the real estate agent. + + - phone: The agent's contact phone number. +``` + +An extraction schema is a JSON-formatted schema that defines the structure of the data that Unstructured extracts. The schema must +conform to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines, +which are a subset of the [JSON Schema](https://json-schema.org/docs) language. + +For this real estate listing example, the schema might look like the following: + +```json +{ + "type": "object", + "properties": { + "property_listing": { + "type": "object", + "properties": { + "street_address": { + "type": "string", + "description": "The full street address of the property including street number, street name, city, state, and postal code." + }, + "square_footage": { + "type": "integer", + "description": "The total living space area of the property, in square feet." + }, + "price": { + "type": "number", + "description": "The listed selling price of the property, in local currency." + }, + "features": { + "type": "array", + "description": "A list of property features and highlights.", + "items": { + "type": "string", + "description": "A single property feature or highlight." + } + }, + "agent_contact": { + "type": "object", + "description": "Contact information for the real estate agent.", + "properties": { + "phone": { + "type": "string", + "description": "The agent's contact phone number." + } + }, + "required": ["phone"], + "additionalProperties": false + } + }, + "required": ["street_address", "square_footage", "price", "features", "agent_contact"], + "additionalProperties": false + } + }, + "required": ["property_listing"], + "additionalProperties": false +} +``` + +You can also use a visual schema builder to define the schema, like this: + +![Visual schema builder](/img/ui/data-extractor/schema-builder.png) + +## Using the structured data extractor + +There are two ways to use the [structured data extractor](#overview) in your Unstructured workflows: + +- From the **Start** page of your Unstructured account. This approach works + only with a single file that is stored on your local machine. [Learn how](#use-the-structured-data-extractor-from-the-start-page). +- From the Unstructured workflow editor. This approach works with a single file that is stored on your local machine, or with any + number of files that are stored in remote locations. [Learn how](#use-the-structured-data-extractor-from-the-workflow-editor). + +### Use the structured data extractor from the Start page + +To have Unstructured [extract the data in a custom-defined format](#overview) for a single file that is stored on your local machine, do the following from the **Start** page: + +1. Sign in to your Unstructured account, if you are not already signed in. +2. On the sidebar, click **Start**, if the **Start** page is not already showing. +3. In the **Welcome, get started right away!** tile, do one of the following: + + - To use a file on your local machine, click **Browse files** and then select the file, or drag and drop the file onto **Drop file to test**. + + + If you use a local file, the file must be 10 MB or less in size. + + + - To use a sample file provided by Unstructured, click one of the the sample files that are shown, such as **realestate.pdf**. + +4. After Unstructured partitions the selected file into Unstructured's document element format, click **Update results** to + have Unstructured apply generative enrichments, such as [image descriptions](/ui/enriching/image-descriptions) and + [generative OCR](/ui/enriching/generative-ocr), to those document elements. +5. In the title bar, next to **Transform**, click **Extract**. +6. If the **Define Schema** pane, do one of the following to extract the data from the selected file by using a custom-defined format: + + - To use the schema based on one that Unstructured suggests after analyzing the selected file, click **Run Schema**. + - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines, + click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; click **Use this Schema**; and then click **Run Schema**. + [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + - To use a visual editor to define the schema, click the ellipses (three dots) icon; click **Reset form**, enter your own custom schema objects and their properties, + and then click **Run Schema**. [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + - To use a plain language prompt to guide Unstructured on how to extract the data, click **Suggest**; enter your propmpt in the + dialog; click **Generate schema**; make any changes to the suggested schema as needed; and then click **Run Schema**. + +7. The extracted data appears in the **Extract results** pane. You can do one of the following: + + - To view a human-viewable representation of the extracted data, click **Formatted**. + - To view the JSON representation of the extracted data, click **JSON**. + - To download the JSON representation of the extracted data as a local JSON file, click the download icon next to **Formatted** and **JSON**. + - To change the schema and then re-run the extraction, click the back arrow next to **Extract Results**, and then skip back to step 6 in this procedure. + +### Use the structured data extractor from the workflow editor + +To have Unstructured [extract the data in a custom-defined format](#overview) for a single file that is stored on your local machine, or with any +number of files that are stored in remote locations, do the following from the workflow editor: + +1. If you already have an Unstructured workflow that you want to use, open it to show the workflow editor. Otherwise, create a new + workflow as follows: + + a. Sign in to your Unstructured account, if you are not already signed in.
+ b. On the sidebar, click **Workflows**.
+ c. Click **New Workflow +**.
+ d. With **Build it Myself** already selected, click **Continue**. The workflow editor appears.
+ +2. Add an **Extract** node to your existing Unstructured workflow. This node must be added right before the workflow's **Destination** node. + To add this node, in the workflow designer, click the **+** (add node) button immediately before the **Destination** node, and then click **Enrich > Extract**. +3. Click the newly added **Extract** node to select it. +4. In the node's settings pane, on the **Details** tab, you can do one of the following: + + - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines, + click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; click **Use this Schema**; and then click **Run Schema**. + [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + - To use a visual editor to define the schema, click the ellipses (three dots) icon; click **Reset form**, enter your own custom schema objects and their properties, + and then click **Run Schema**. [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + +5. Continue building your workflow as desired. +6. To see the results of the structured data extractor, do one of the following: + + - If you have already selected a local file as input to your workflow, click **Test** immediately above the **Source** node. The results will be displayed on-screen + in the **Test output** pane. + - If you are using source and destination connectors for your workflow, [run the workflow as a job](/ui/jobs#run-a-job), + [monitor the job](/ui/jobs#monitor-a-job), and then examine the job'sresults in your destination location. + +## Limitations + +The structured data extractor is not guaranteed to work with the [Pinecone destination connector](/ui/destinations/pinecone). +This is because Pinecone has strict limits on the amount of metadata that it can manage. These limits are +below the threshold of what the structured data extractor typically needs for the amount of metadata that it manages. + +## Saving the extracted data separately + +There might be cases where you want to save the contents of the `extracted_data` field separately from the rest of Unstructured's JSON output. +To do this, you could use a Python script such as the following. This script works with one or more Unstructured JSON output files that you already have stored +on the same machine as this script. Before you run this script, do the following: + +- To process all Unstructured JSON files within a directory, change `None` for `input_dir` to a string that contains the path to the directory. This can be a relative or absolute path. +- To process specific Unstructured JSON files within a directory or across multiple directories, change `None` for `input_file` to a string that contains a comma-separated list of filepaths on your local machine, for example `"./input/2507.13305v1.pdf.json,./input2/table-multi-row-column-cells.pdf.json"`. These filepaths can be relative or absolute. + + + If `input_dir` and `input_file` are both set to something other than `None`, then the `input_dir` setting takes precedence, and the `input_file` setting is ignored. + + +- For the `output_dir` parameter, specify a string that contains the path to the directory on your local machine that you want to send the `extracted_data` JSON. If the specified directory does not exist at that location, the code will create the missing directory for you. This path can be relative or absolute. + +```python +import asyncio +import os +import json + +async def process_file_and_save_result(input_filename, output_dir): + with open(input_filename, "r") as f: + input_data = json.load(f) + + if input_data[0].get("type") == "DocumentData": + if "extracted_data" in input_data[0]["metadata"]: + extracted_data = input_data[0]["metadata"]["extracted_data"] + + results_name = f"{os.path.basename(input_filename)}" + output_filename = os.path.join(output_dir, results_name) + + try: + with open(output_filename, "w") as f: + json.dump(extracted_data, f) + print(f"Successfully wrote 'metadata.extracted_data' to '{output_filename}'.") + except Exception as e: + print(f"Error: Failed to write 'metadata.extracted_data' to '{output_filename}'.") + else: + print(f"Error: Cannot find 'metadata.extracted_data' field in '{input_filename}'.") + else: + print(f"Error: The first element in '{input_filename}' does not have 'type' set to 'DocumentData'.") + + +def load_filenames_in_directory(input_dir): + filenames = [] + for root, _, files in os.walk(input_dir): + for file in files: + if file.endswith('.json'): + filenames.append(os.path.join(root, file)) + print(f"Found JSON file '{file}'.") + else: + print(f"Error: '{file}' is not a JSON file.") + + return filenames + +async def process_files(): + # Initialize with either a directory name, to process everything in the dir, + # or a comma-separated list of filepaths. + input_dir = None # "path/to/input/directory" + input_files = None # "path/to/file,path/to/file,path/to/file" + + # Set to the directory for output json files. This dir + # will be created if needed. + output_dir = "./extracted_data/" + + if input_dir: + filenames = load_filenames_in_directory(input_dir) + else: + filenames = input_files.split(",") + + os.makedirs(output_dir, exist_ok=True) + + tasks = [] + for filename in filenames: + tasks.append( + process_file_and_save_result(filename, output_dir) + ) + + await asyncio.gather(*tasks) + +if __name__ == "__main__": + asyncio.run(process_files()) +``` + +## Additional examples + +In addition to the preceding real estate listing example, here are some more examples that you can adapt for your own use. + +### Caring for houseplants + +Using the following image file ([download this file](https://raw.githubusercontent.com/Unstructured-IO/docs/main/img/ui/data-extractor/house-plant-care.png)): + +![Caring for houseplants](/img/ui/data-extractor/house-plant-care.png) + +An extraction schema for this file might look like the following: + +```json +{ + "type": "object", + "properties": { + "plants": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "The name of the plant" + }, + "sunlight": { + "type": "string", + "description": "The sunlight requirements for the plant (for example: 'Direct', 'Bright Indirect - Some direct')." + }, + "water": { + "type": "string", + "description": "The watering instructions for the plant (for example: 'Let dry between thorough watering', 'Water when 50-60% dry')." + }, + "humidity": { + "type": "string", + "description": "The humidity requirements for the plant (for example:'Low', 'Medium', 'High')" + } + }, + "required": ["name", "sunlight", "water", "humidity"], + "additionalProperties": false + } + } + }, + "required": ["plants"], + "additionalProperties": false +} +``` + +An extraction guidance prompt for this file might look like the following: + +```text +Extract the plant information for each of the plants in this document, and present it in the following format: + +- plants: A list of plants. + + - name: The name of the plant. + - sunlight: The sunlight requirements for the plant (for example: 'Direct', 'Bright Indirect - Some direct'). + - water: The watering instructions for the plant (for example: 'Let dry between thorough watering', 'Water when 50-60% dry'). + - humidity: The humidity requirements for the plant (for example: 'Low', 'Medium', 'High'). +``` + +And Unstructured's output would look like the following: + +```json +[ + { + "type": "DocumentData", + "element_id": "3be179f1-e1e5-4dde-a66b-9c370b6d23e8", + "text": "", + "metadata": { + "...": "...", + "extracted_data": { + "plants": [ + { + "name": "Krimson Queen", + "sunlight": "Bright Indirect - Some direct", + "water": "Let dry between thorough watering", + "humidity": "Low" + }, + { + "name": "Chinese Money Plant", + "sunlight": "Bright Indirect - Some direct", + "water": "Let dry between thorough watering", + "humidity": "Low - Medium" + }, + { + "name": "String of Hearts", + "sunlight": "Direct - Bright Indirect", + "water": "Let dry between thorough watering", + "humidity": "Low" + }, + { + "name": "Marble Queen", + "sunlight": "Low- High Indirect", + "water": "Water when 50 - 80% dry", + "humidity": "Low - Medium" + }, + { + "name": "Sansevieria Whitney", + "sunlight": "Direct - Low Direct", + "water": "Let dry between thorough watering", + "humidity": "Low" + }, + { + "name": "Prayer Plant", + "sunlight": "Medium - Bright Indirect", + "water": "Keep soil moist", + "humidity": "Medium - High" + }, + { + "name": "Aloe Vera", + "sunlight": "Direct - Bright Indirect", + "water": "Water when dry", + "humidity": "Low" + }, + { + "name": "Philodendron Brasil", + "sunlight": "Bright Indirect - Some direct", + "water": "Water when 80% dry", + "humidity": "Low - Medium" + }, + { + "name": "Pink Princess", + "sunlight": "Bright Indirect - Some direct", + "water": "Water when 50 - 80% dry", + "humidity": "Medium" + }, + { + "name": "Stromanthe Triostar", + "sunlight": "Bright Indirect", + "water": "Keep soil moist", + "humidity": "Medium - High" + }, + { + "name": "Rubber Plant", + "sunlight": "Bright Indirect - Some direct", + "water": "Let dry between thorough watering", + "humidity": "Low - Medium" + }, + { + "name": "Monstera Deliciosa", + "sunlight": "Bright Indirect - Some direct", + "water": "Water when 80% dry", + "humidity": "Low - Medium" + } + ] + } + } + }, + { + "...": "..." + } +] +``` + +### Medical invoicing + +Using the following PDF file ([download this file](https://raw.githubusercontent.com/Unstructured-IO/docs/main/img/ui/data-extractor/spinalogic-bone-growth-stimulator-form.pdf)): + +![Medical invoice](/img/ui/data-extractor/medical-invoice.png) + +An extraction schema for this file might look like the following: + +```json +{ + "type": "object", + "properties": { + "patient": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "Full name of the patient." + }, + "birth_date": { + "type": "string", + "description": "Patient's date of birth." + }, + "sex": { + "type": "string", + "enum": ["M", "F", "Other"], + "description": "Patient's biological sex." + } + }, + "required": ["name", "birth_date", "sex"], + "additionalProperties": false + }, + "medical_summary": { + "type": "object", + "properties": { + "prior_procedures": { + "type": "array", + "items": { + "type": "object", + "properties": { + "procedure": { + "type": "string", + "description": "Name or type of the medical procedure." + }, + "date": { + "type": "string", + "description": "Date when the procedure was performed." + }, + "levels": { + "type": "string", + "description": "Anatomical levels or location of the procedure." + } + }, + "required": ["procedure", "date", "levels"], + "additionalProperties": false + }, + "description": "List of prior medical procedures." + }, + "diagnoses": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of medical diagnoses." + }, + "comorbidities": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of comorbid conditions." + } + }, + "required": ["prior_procedures", "diagnoses", "comorbidities"], + "additionalProperties": false + } + }, + "required": ["patient", "medical_summary"], + "additionalProperties": false +} +``` + +An extraction guidance prompt for this file might look like the following: + +```text +Extract the medical information from this record, and present it in the following format: + +- patient + + - name: Full name of the patient. + - birth_date: Patient's date of birth. + - sex: Patient's biological sex. + +- medical_summary + + - prior_procedures + + - procedure: Name or type of the medical procedure. + - date: Date when the procedure was performed. + - levels: Anatomical levels or location of the procedure. + + - diagnoses: List of medical diagnoses. + - comorbidities: List of comorbid conditions. + +Additional extraction guidance: + +- name: Extract the full legal name as it appears in the document. Use proper capitalization (for example: "Marissa K. Donovan"). +- birth_date: Convert to format "MM/DD/YYYY" (for example: "03/28/1976"), + + - Accept variations: MM/DD/YYYY, MM-DD-YYYY, YYYY-MM-DD, Month DD, YYYY, + - If only age is given, do not infer birth date - mark as null, + +- sex: Extract biological sex as single letter: "M" (Male), "F" (Female), or "X" (Other) + + - Map variations: Male/Man → "M", Female/Woman → "F", Others → "X" + +- prior_procedures: + + Extract all surgical and major medical procedures, including: + + - procedure: Use standard medical terminology when possible. + - date: Format as "MM/DD/YYYY". If only year/month available, use "01" for missing day. + - levels: Include anatomical locations, vertebral levels, or affected areas. + + - For spine procedures: Use format like "L4 to L5" or "L4-L5". + - Include laterality when specified (left, right, bilateral). + + - diagnoses: + + Extract all current and historical diagnoses: + + - Include both primary and secondary diagnoses. + - Preserve medical terminology and ICD-10 descriptions if provided. + - Include location/region specifications (for example: "radiculopathy — lumbar region"). + - Do not include procedure names unless they represent a diagnostic condition. + + - comorbidities + + Extract all coexisting medical conditions that may impact treatment: + + - Include chronic conditions (for example: "diabetes", "hypertension"). + - Include relevant surgical history that affects current state (for example: Failed Fusion, Multi-Level Fusion). + - Include structural abnormalities (for example: Spondylolisthesis, Stenosis). + - Do not duplicate items already listed in primary diagnoses. + +Data quality rules: + +1. Completeness: Only include fields where data is explicitly stated or clearly indicated. +2. No inference: Do not infer or assume information not present in the source. +3. Preserve specificity: Maintain medical terminology and specificity from source. +4. Handle missing data: Return empty arrays [] for sections with no data, never null. +5. Date validation: Ensure all dates are realistic and properly formatted. +6. Deduplication: Avoid listing the same condition in multiple sections. + +Common variations to handle: + +- Operative reports: Focus on procedure details, dates, and levels. +- H&P (history & physical): Rich source for all sections. +- Progress notes: May contain updates to diagnoses and new procedures. +- Discharge summaries: Comprehensive source for all data points. +- Consultation notes: Often contain detailed comorbidity lists. +- Spinal levels: C1-C7 (Cervical), T1-T12 (Thoracic), L1-L5 (Lumbar), S1-S5 (Sacral). +- Use "fusion surgery" not "fusion" alone when referring to procedures. +- Preserve specificity: "Type 2 Diabetes" not just "Diabetes" when specified. +- Multiple procedures same date**: List as separate objects in the array. +- Revised procedures: Include both original and revision as separate entries. +- Bilateral procedures: Note as single procedure with "bilateral" in levels. +- Uncertain dates: If date is approximate (for example, "Spring 2023"), use "01/04/2023" for Spring, "01/07/2023" for Summer, and so on. +- Name variations: Use the most complete version found in the document. +- Conflicting information**: Use the most recent or most authoritative source. + +Output validation: + +Before returning the extraction: + +1. Verify all required fields are present. +2. Check date formats are consistent. +3. Ensure no duplicate entries within arrays. +4. Confirm sex field contains only "M", "F", or "Other". +5. Validate that procedures have all three required fields. +6. Ensure diagnoses and comorbidities are non-overlapping. +``` + +And Unstructured's output would look like the following: + +```json +[ + { + "type": "DocumentData", + "element_id": "e8f09cb1-1439-4e89-af18-b6285aef5d37", + "text": "", + "metadata": { + "...": "...", + "extracted_data": { + "patient": { + "name": "Ms. Daovan", + "birth_date": "01/01/1974", + "sex": "F" + }, + "medical_summary": { + "prior_procedures": [], + "diagnoses": [ + "Radiculopathy — lumbar region" + ], + "comorbidities": [ + "Diabetes", + "Multi-Level Fusion", + "Failed Fusion", + "Spondylolisthesis" + ] + } + } + } + }, + { + "...": "..." + } +] +``` \ No newline at end of file diff --git a/ui/walkthrough.mdx b/ui/walkthrough.mdx index 22221f56..83c24c3b 100644 --- a/ui/walkthrough.mdx +++ b/ui/walkthrough.mdx @@ -4,7 +4,7 @@ sidebarTitle: Walkthrough --- This walkthrough provides you with deep, hands-on experience with the [Unstructured user interface (UI)](/ui/overview). As you follow along, you will learn how to use many of Unstructured's -features for [partitioning](/ui/partitioning), [enriching](/ui/enriching/overview), [chunking](/ui/chunking), and [embedding](/ui/embedding). These features are optimized for turning +features for [partitioning](/ui/partitioning), [enriching](/ui/enriching/overview), [chunking](/ui/chunking), [embedding](/ui/embedding), and [structured data extraction](/ui/data-extractor). These features are optimized for turning your source documents and data into information that is well-tuned for [retrieval-augmented generation (RAG)](https://unstructured.io/blog/rag-whitepaper), [agentic AI](https://unstructured.io/problems-we-solve#powering-agentic-ai), @@ -557,9 +557,252 @@ embedding model that is provided by an embedding provider. For the best embeddin 6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to the workflow designer so that you can continue designing things later as you see fit. +## Step 7: Experiment with structured data extraction + +In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured +automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured +partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured +output key information from the source documents in a custom structured data format, within a `DocumentData` element containing aJSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on. + +1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**. + + ![Adding an extract node](/img/ui/walkthrough/AddExtract.png) + +2. In the node's settings pane's **Details** tab, in the **Schema** box, enter the following JSON schema: + + ```json + { + "type": "object", + "properties": { + "title": { + "type": "string", + "description": "Full title of the research paper" + }, + "authors": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "Author's full name" + }, + "affiliation": { + "type": "string", + "description": "Author's institutional affiliation" + }, + "email": { + "type": "string", + "description": "Author's email address" + } + }, + "required": [ + "name", + "affiliation", + "email" + ], + "additionalProperties": false + }, + "description": "List of paper authors with their affiliations" + }, + "abstract": { + "type": "string", + "description": "Paper abstract summarizing the research" + }, + "introduction": { + "type": "string", + "description": "Introduction section describing the problem and motivation" + }, + "methodology": { + "type": "object", + "properties": { + "approach_name": { + "type": "string", + "description": "Name of the proposed method (e.g., StrokeNet)" + }, + "description": { + "type": "string", + "description": "Detailed description of the methodology" + }, + "key_techniques": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of key techniques used in the approach" + } + }, + "required": [ + "approach_name", + "description", + "key_techniques" + ], + "additionalProperties": false + }, + "experiments": { + "type": "object", + "properties": { + "datasets": { + "type": "array", + "items": { + "type": "object", + "properties": { + "name": { + "type": "string", + "description": "Dataset name" + }, + "description": { + "type": "string", + "description": "Dataset description" + }, + "size": { + "type": "string", + "description": "Dataset size (e.g., number of sentence pairs)" + } + }, + "required": [ + "name", + "description", + "size" + ], + "additionalProperties": false + }, + "description": "Datasets used for evaluation" + }, + "baselines": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Baseline methods compared against" + }, + "evaluation_metrics": { + "type": "array", + "items": { + "type": "string" + }, + "description": "Metrics used for evaluation" + }, + "experimental_setup": { + "type": "string", + "description": "Description of experimental configuration and hyperparameters" + } + }, + "required": [ + "datasets", + "baselines", + "evaluation_metrics", + "experimental_setup" + ], + "additionalProperties": false + }, + "results": { + "type": "object", + "properties": { + "main_findings": { + "type": "string", + "description": "Summary of main experimental findings" + }, + "performance_improvements": { + "type": "array", + "items": { + "type": "object", + "properties": { + "dataset": { + "type": "string", + "description": "Dataset name" + }, + "metric": { + "type": "string", + "description": "Evaluation metric (e.g., BLEU)" + }, + "baseline_score": { + "type": "number", + "description": "Baseline method score" + }, + "proposed_score": { + "type": "number", + "description": "Proposed method score" + }, + "improvement": { + "type": "number", + "description": "Improvement over baseline" + } + }, + "required": [ + "dataset", + "metric", + "baseline_score", + "proposed_score", + "improvement" + ], + "additionalProperties": false + }, + "description": "Performance improvements over baselines" + }, + "parameter_reduction": { + "type": "string", + "description": "Description of parameter reduction achieved" + } + }, + "required": [ + "main_findings", + "performance_improvements", + "parameter_reduction" + ], + "additionalProperties": false + }, + "related_work": { + "type": "string", + "description": "Summary of related work and prior research" + }, + "conclusion": { + "type": "string", + "description": "Conclusion section summarizing contributions and findings" + }, + "limitations": { + "type": "string", + "description": "Limitations and challenges discussed in the paper" + }, + "acknowledgments": { + "type": "string", + "description": "Acknowledgments section" + }, + "references": { + "type": "array", + "items": { + "type": "string" + }, + "description": "List of cited references" + } + }, + "additionalProperties": false, + "required": [ + "title", + "authors", + "abstract", + "introduction", + "methodology", + "experiments", + "results", + "related_work", + "conclusion", + "limitations", + "acknowledgments", + "references" + ] + } + ``` + +3. With the "Chinese Characters" PDF file still selected in the **Source** node, click **Test**. +4. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow. +5. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks). +6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to + the workflow designer so that you can continue designing things later as you see fit. + ## Next steps -Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing +Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning. Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file. diff --git a/ui/workflows.mdx b/ui/workflows.mdx index d62968d7..4a22a936 100644 --- a/ui/workflows.mdx +++ b/ui/workflows.mdx @@ -178,6 +178,26 @@ If you did not previously set the workflow to run on a schedule, you can [run th flowchart LR Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Destination ``` + ```mermaid + flowchart LR + Source-->Partitioner-->Extract-->Destination + ``` + ```mermaid + flowchart LR + Source-->Partitioner-->Chunker-->Extract-->Destination + ``` + ```mermaid + flowchart LR + Source-->Partitioner-->Chunker-->Embedder-->Extract-->Destination + ``` + ```mermaid + flowchart LR + Source-->Partitioner-->Enrichment-->Chunker-->Extract-->Destination + ``` + ```mermaid + flowchart LR + Source-->Partitioner-->Enrichment-->Chunker-->Embedder-->Extract-->Destination + ``` For workflows that use **Chunker** and **Enrichment** nodes together, the **Chunker** node should be placed after all **Enrichment** nodes. Placing the @@ -212,7 +232,7 @@ If you did not previously set the workflow to run on a schedule, you can [run th ![Add node to workflow](/img/ui/Workflow-Add-Node.png) - Click **Connect** to add another **Source** or **Destination** node. You can add multiple source and destination locations. Files will be ingested from all of the source locations, and the processed data will be delivered to all of the destination locations. [Learn more](#custom-workflow-node-types). - - Click **Enrich** to add a **Chunker** or **Enrichment** node. [Learn more](#custom-workflow-node-types). + - Click **Enrich** to add a **Chunker**, **Enrichment**, or **Extract** node. [Learn more](#custom-workflow-node-types). @@ -389,6 +409,17 @@ import DeprecatedModelsUI from '/snippets/general-shared-text/deprecated-models- - [Embedding overview](/ui/embedding) - [Understanding embedding models: make an informed choice for your RAG](https://unstructured.io/blog/understanding-embedding-models-make-an-informed-choice-for-your-rag). + + Do one of the following to define the custom schema for the structured data that you want to extract: + + - To use a custom schema that conforms to the [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) guidelines, + click **Upload JSON**; enter your own custom schema or upload a JSON file that contains your custom schema; click **Use this Schema**; and then click **Run Schema**. + [Learn about the OpenAI Structured Outputs format](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + - To use a visual editor to define the schema, click the ellipses (three dots) icon; click **Reset form**, enter your own custom schema objects and their properties, + and then click **Run Schema**. [Learn about OpenAI Structured Outputs data types](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas). + + [Learn more](/ui/data-extractor). + ## Edit, delete, or run a workflow