Skip to content
Draft
68 changes: 68 additions & 0 deletions api-reference/workflow/workflows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2232,6 +2232,74 @@ Allowed values for `subtype` and `model_name` include the following:
- `"model_name": "voyage-code-2"`
- `"model_name": "voyage-multimodal-3"`

### Extract node

An **Extract** node has a `type` of `document_data_extractor` and a `subtype` of `llm`.

<AccordionGroup>
<Accordion title="Python SDK">
```python
embedder_workflow_node = WorkflowNode(
name="Extractor",
subtype="llm",
type="document_data_extractor",
settings={
"schema_to_extract": {
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"provider": "<provider>",
"model": "<model>"
}
)
```
</Accordion>
<Accordion title="curl, Postman">
```json
{
"name": "Extractor",
"type": "document_data_extractor",
"subtype": "llm",
"settings": {
"schema_to_extract": {
"json_schema": "<json-schema>",
"extraction_guidance": "<extraction-guidance>"
},
"provider": "<provider>",
"model": "<model>"
}
}
```
</Accordion>
</AccordionGroup>

Fields for `settings` include:

- `schema_to_extract`: _Required_. The schema or guidance for the structured data that you want to extract. One (and only one) of the following must also be specified:

- `json_schema`: The extraction schema, in [OpenAI Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas) format, for the structured data that you want to extract, expressed as a single string.
- `extraction_guidance`: The extraction prompt for the structured data that you want to extract, expressed as a single string.

- Allowed values for `provider` and `model` include the following:

- `"provider": "anthropic"`

- `"model": "..."`

- `"provider": "azure_openai"`

- `"model": "..."`

- `"provider": "bedrock"`

- `"model": "..."`

- `"provider": "openai"`

- `"model": "..."`

[Learn more](/ui/data-extractor).

## List templates

To list templates, use the `UnstructuredClient` object's `templates.list_templates` function (for the Python SDK) or the `GET` method to call the `/templates` endpoint (for `curl` or Postman).
Expand Down
1 change: 1 addition & 0 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@
"pages": [
"ui/document-elements",
"ui/partitioning",
"ui/data-extractor",
"ui/chunking",
{
"group": "Enriching",
Expand Down
Binary file added img/ui/data-extractor/house-plant-care.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/medical-invoice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/real-estate-listing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ui/data-extractor/schema-builder.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
30 changes: 30 additions & 0 deletions snippets/general-shared-text/data-extractor.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
1. In the **Schema settings** pane, for **Method**, choose one of the following to extract the source's data into a custom-defined, structured output format:

- Choose **LLM** to use a large language model (LLM).
- Choose **Regex** (or **JSONPath**) to use regular expressions (or JSONPath expressions).

2. If you chose **LLM** under **Method**, then continue with step 3 in this procedure.

If you chose **Regex** (or **JSONPath**) under **Method** instead, then skip ahead to step 6 in this procedure.

3. If you chose **LLM** under **Method**, then in the **Provider** and **Model** drop-down lists, choose the LLM provider and model that you want to use for the data extraction.
4. For **Extraction fields**, do one of the following:

- Choose **Suggested** to start with a set of fields that the selected LLM has suggested for the data extraction.
As needed, you can add, change, or delete any of these suggested fields' names, data types, descriptions, or their relationships to other fields within the same schema.
- Choose **Prompt** to provide an AI prompt to the selected LLM to use to generate a set of suggested fields for the data extraction.
To generate the list of suggested fields, click **Generate schema** next to **Prompt**.
As needed, you can add, change, or delete any of these suggested fields' names, data types, descriptions, or their relationships to other fields within the same schema.
- Choose **Create** to manually specify the set of fields for the selected LLM to use for the data extraction. You can specify each field's name, data type, description, and its relationships to other fields within the same schema.

5. Skip ahead to step 7 in this procedure.
6. If you chose **Regex** (or **JSONPath**) under **Method**, then do one of the following:

- Choose **Suggested** to start with a set of fields that the default LLM has suggested for the data extraction.
As needed, you can add, change, or delete any of these suggested fields' names, regular expressions (or JSONPath expressions), or their relationships to other fields within the same schema.
- Choose **Prompt** to provide an AI prompt to the default LLM to use to generate a set of suggested fields for the data extraction.
To generate the list of suggested fields, click **Generate schema** next to **Prompt**.
As needed, you can add, change, or delete any of these suggested fields' names, regular expressions (or JSONPath expressions), or their relationships to other fields within the same schema.
- Choose **Create** to manually specify the set of fields for the default LLM to use for the data extraction. You can specify each field's name, regular expression (or JSONPath expression), and its relationships to other fields within the same schema.

7. Click **Run** to extract the source's data into the custom-defined, structured output format.
245 changes: 244 additions & 1 deletion snippets/general-shared-text/get-started-single-file-ui-part-2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -465,9 +465,252 @@ embedding model that is provided by an embedding provider. For the best embeddin
6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.

## Step 7: Experiment with structured data extraction

In this step, you apply custom [structured data extraction](/ui/data-extractor) to your workflow. Structured data extraction is the process where Unstructured
automatically extracts the data from your source documents into a format that you define up front. For example, in addition to Unstructured
partitioning your source documents into elements with types such as `NarrativeText`, `UncategorizedText`, and so on, you can have Unstructured
output key information from the source documents in a custom structured data format, appearing within a `DocumentData` element that contains a JSON object with custom fields such as `name`, `address`, `phone`, `email`, and so on.

1. With the workflow designer active from the previous step, just before the **Destination** node, click the add (**+**) icon, and then click **Enrich > Extract**.

![Adding an extract node](/img/ui/walkthrough/AddExtract.png)

2. In the node's settings pane's **Details** tab, in the **Schema** box, enter the following JSON schema:

```json
{
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Full title of the research paper"
},
"authors": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Author's full name"
},
"affiliation": {
"type": "string",
"description": "Author's institutional affiliation"
},
"email": {
"type": "string",
"description": "Author's email address"
}
},
"required": [
"name",
"affiliation",
"email"
],
"additionalProperties": false
},
"description": "List of paper authors with their affiliations"
},
"abstract": {
"type": "string",
"description": "Paper abstract summarizing the research"
},
"introduction": {
"type": "string",
"description": "Introduction section describing the problem and motivation"
},
"methodology": {
"type": "object",
"properties": {
"approach_name": {
"type": "string",
"description": "Name of the proposed method (e.g., StrokeNet)"
},
"description": {
"type": "string",
"description": "Detailed description of the methodology"
},
"key_techniques": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of key techniques used in the approach"
}
},
"required": [
"approach_name",
"description",
"key_techniques"
],
"additionalProperties": false
},
"experiments": {
"type": "object",
"properties": {
"datasets": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Dataset name"
},
"description": {
"type": "string",
"description": "Dataset description"
},
"size": {
"type": "string",
"description": "Dataset size (e.g., number of sentence pairs)"
}
},
"required": [
"name",
"description",
"size"
],
"additionalProperties": false
},
"description": "Datasets used for evaluation"
},
"baselines": {
"type": "array",
"items": {
"type": "string"
},
"description": "Baseline methods compared against"
},
"evaluation_metrics": {
"type": "array",
"items": {
"type": "string"
},
"description": "Metrics used for evaluation"
},
"experimental_setup": {
"type": "string",
"description": "Description of experimental configuration and hyperparameters"
}
},
"required": [
"datasets",
"baselines",
"evaluation_metrics",
"experimental_setup"
],
"additionalProperties": false
},
"results": {
"type": "object",
"properties": {
"main_findings": {
"type": "string",
"description": "Summary of main experimental findings"
},
"performance_improvements": {
"type": "array",
"items": {
"type": "object",
"properties": {
"dataset": {
"type": "string",
"description": "Dataset name"
},
"metric": {
"type": "string",
"description": "Evaluation metric (e.g., BLEU)"
},
"baseline_score": {
"type": "number",
"description": "Baseline method score"
},
"proposed_score": {
"type": "number",
"description": "Proposed method score"
},
"improvement": {
"type": "number",
"description": "Improvement over baseline"
}
},
"required": [
"dataset",
"metric",
"baseline_score",
"proposed_score",
"improvement"
],
"additionalProperties": false
},
"description": "Performance improvements over baselines"
},
"parameter_reduction": {
"type": "string",
"description": "Description of parameter reduction achieved"
}
},
"required": [
"main_findings",
"performance_improvements",
"parameter_reduction"
],
"additionalProperties": false
},
"related_work": {
"type": "string",
"description": "Summary of related work and prior research"
},
"conclusion": {
"type": "string",
"description": "Conclusion section summarizing contributions and findings"
},
"limitations": {
"type": "string",
"description": "Limitations and challenges discussed in the paper"
},
"acknowledgments": {
"type": "string",
"description": "Acknowledgments section"
},
"references": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of cited references"
}
},
"additionalProperties": false,
"required": [
"title",
"authors",
"abstract",
"introduction",
"methodology",
"experiments",
"results",
"related_work",
"conclusion",
"limitations",
"acknowledgments",
"references"
]
}
```

3. Immediately above the **Source** node, click **Test**.
4. In the **Test output** pane, make sure that **Extract (9 of 9)** is showing. If not, click the right arrow (**>**) until **Extract (9 of 9)** appears, which will show the output from the last node in the workflow.
5. To explore the structured data extraction, search for the text `"extracted_data"` (including the quotation marks).
6. When you are done, be sure to click the close (**X**) button above the output on the right side of the screen, to return to
the workflow designer so that you can continue designing things later as you see fit.

## Next steps

Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, and embeds your source documents, producing
Congratulations! You now have an Unstructured workflow that partitions, enriches, chunks, embeds, and extracts structured data from your source documents, producing
context-rich data that is ready for retrieval-augmented generation (RAG), agentic AI, and model fine-tuning.

Right now, your workflow only accepts one local file at a time for input. Your workflow also only sends Unstructured's processed data to your screen or to be saved locally as a JSON file.
Expand Down
3 changes: 2 additions & 1 deletion snippets/general-shared-text/get-started-single-file-ui.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ You can also do the following:

What's next?

- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, and additional enrichments to your local file results](/ui/walkthrough-2).
- <Icon icon="code" />&nbsp;&nbsp;[Learn how to extract structured data in a custom format from your local file](/ui/data-extractor#use-the-structured-data-extractor-from-the-start-page).
- <Icon icon="plus" />&nbsp;&nbsp;[Learn how to add chunking, embeddings, custom structured data extraction, and additional enrichments to your local file results](/ui/walkthrough-2).
- <Icon icon="database" />&nbsp;&nbsp;[Learn how to do large-scale batch processing of multiple files and semi-structured data that are stored in remote locations instead](/ui/quickstart#remote-quickstart).
- <Icon icon="desktop" />&nbsp;&nbsp;[Learn more about the Unstructured user interface](/ui/overview).
Loading