Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,16 +13,16 @@

## 📖 Description

With `extrai`, you can extract data from text documents with LLMs, which will be formatted into a given `SQLModel` and registered in your database.
`extrai` extracts data from text documents using LLMs, formatting the output into a given `SQLModel` and registering it in a database.

The core of the library is its [Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html). We make the same request multiple times, using the same or different providers, and then select the values that meet a certain threshold.
The library utilizes a [Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html) to ensure accuracy. It makes the same request multiple times, using the same or different providers, and then selects the values that meet a configured threshold.

`extrai` also has other features, like [generating `SQLModel`s](https://docs.extrai.xyz/how_to/generate_sql_model.html) from a prompt and documents, and [generating few-shot examples](https://docs.extrai.xyz/how_to/generate_example_json.html). For complex, nested data, the library offers [Hierarchical Extraction](https://docs.extrai.xyz/how_to/handle_complex_data_with_hierarchical_extraction.html), breaking down the extraction into manageable, hierarchical steps. It also includes [built-in analytics](https://docs.extrai.xyz/analytics_collector.html) to monitor performance and output quality.

## ✨ Key Features

- **[Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html)**: Improves extraction accuracy by consolidating multiple LLM outputs.
- **[Dynamic SQLModel Generation](https://docs.extrai.xyz/sqlmodel_generator.html)**: Generate `SQLModel` schemas from natural language descriptions.
- **[Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html)**: Consolidates multiple LLM outputs to improve extraction accuracy.
- **[Dynamic SQLModel Generation](https://docs.extrai.xyz/sqlmodel_generator.html)**: Generates `SQLModel` schemas from natural language descriptions.
- **[Hierarchical Extraction](https://docs.extrai.xyz/how_to/handle_complex_data_with_hierarchical_extraction.html)**: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.
- **[Extensible LLM Support](https://docs.extrai.xyz/llm_providers.html)**: Integrates with various LLM providers through a client interface.
- **[Built-in Analytics](https://docs.extrai.xyz/analytics_collector.html)**: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.
Expand Down Expand Up @@ -59,7 +59,7 @@ For a complete guide, please see the full documentation. Here are the key sectio
- **Community**
- [Contributing Guide](https://docs.extrai.xyz/contributing.html)

## ⚙️ Worflow Overview
## ⚙️ Workflow Overview

The library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see [Architecture Overview](https://docs.extrai.xyz/concepts/architecture_overview.html)):

Expand Down
120 changes: 120 additions & 0 deletions docs/advanced/batch_deep_dive.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Batch Processing Deep Dive

Batch processing in `extrai` is designed for high-volume extraction tasks where immediate results are not required and cost optimization is a priority. It leverages the "Batch API" features of LLM providers (like OpenAI's Batch API) to process requests asynchronously at a lower cost (often 50% cheaper).

## The Batch State Machine

The batch pipeline is managed by a robust state machine that transitions a job through various phases. This ensures that even long-running jobs can be tracked, resumed, and recovered in case of failures.

### States

The `BatchJobStatus` enum defines the possible states:

* **SUBMITTED**: The initial extraction request has been sent to the LLM provider.
* **PROCESSING**: The provider is currently running the batch.
* **READY_TO_PROCESS**: The provider has finished, and the results are downloaded but not yet processed by `extrai`.
* **COMPLETED**: All results have been processed, consensus run, and objects hydrated.
* **FAILED**: The batch job failed at the provider or during local processing.
* **CANCELLED**: The job was manually cancelled.

#### Counting Phase States

If entity counting is enabled, the job goes through a "pre-flight" counting phase:

* **COUNTING_SUBMITTED**: The counting request is with the provider.
* **COUNTING_PROCESSING**: Counting is running.
* **COUNTING_READY_TO_PROCESS**: Counts are ready to be used for generating the main extraction prompts.

### State Transition Diagram

```mermaid
stateDiagram-v2
[*] --> COUNTING_SUBMITTED: count_entities=True
[*] --> SUBMITTED: count_entities=False

COUNTING_SUBMITTED --> COUNTING_PROCESSING
COUNTING_PROCESSING --> COUNTING_READY_TO_PROCESS
COUNTING_READY_TO_PROCESS --> SUBMITTED: Generate Extraction Prompts

SUBMITTED --> PROCESSING
PROCESSING --> READY_TO_PROCESS
READY_TO_PROCESS --> COMPLETED: Consensus & Hydration

PROCESSING --> FAILED
SUBMITTED --> FAILED
COUNTING_PROCESSING --> FAILED

FAILED --> [*]
COMPLETED --> [*]
```

## Production Workflows

### Submission & Polling

The `WorkflowOrchestrator.synthesize_batch` method handles the lifecycle. You can use it in a blocking or non-blocking way.

**Blocking (Simplest):**
```python
results = await orchestrator.synthesize_batch(
input_strings=docs,
wait_for_completion=True
)
```
This will poll the provider internally, handle transitions from counting to extraction, and return the final objects.

**Non-Blocking (Async):**
```python
# Submit
batch_id = await orchestrator.synthesize_batch(
input_strings=docs,
wait_for_completion=False
)

# Later... check status
current_status = await orchestrator.get_batch_status(batch_id, db_session)
if current_status.status == BatchJobStatus.COMPLETED:
results = await orchestrator.get_batch_results(batch_id)
```

### Error Recovery & Resuming

If your application crashes while a batch is running, you don't lose the job. The `root_batch_id` is the key to recovery.

**Resuming Monitoring (e.g. after script restart)**

If the batch job is still active or completed at the provider, but your script stopped monitoring it, simply call `monitor_batch_job` again:

```python
# Resume monitoring
results = await orchestrator.monitor_batch_job(
root_batch_id="batch_123_abc",
db_session=session,
poll_interval=60
)
```
This method inspects the current state (e.g., `COUNTING_SUBMITTED`, `READY_TO_PROCESS`) and automatically picks up where it left off, handling transitions between phases.

**Retrying or Extending a Batch**

If a batch failed or if you want to extend a completed workflow (e.g., adding more hierarchical steps), use `create_continuation_batch`. This creates a *new* batch job that copies the successful steps from the old one, saving time and money.

```python
# Continue from step 2
new_batch_id = await orchestrator.create_continuation_batch(
original_batch_id="failed_batch_id",
db_session=session,
start_from_step_index=2, # Copy steps 0 and 1, restart from 2
wait_for_completion=True
)
```

## Hierarchical Batches & Shallow Schemas

When using `use_hierarchical_extraction=True` with batches, the process involves multiple "hops".

1. **Level 1 Extraction**: The root object is extracted.
2. **Shallow Schema Generation**: For lists of children, `extrai` generates "Shallow Schemas" (just IDs and essential fields) to keep the context window small.
3. **Child Batches**: New batch jobs are spawned for each child entity to extract full details.

This complex coordination is handled automatically by the `BatchPipeline`.
15 changes: 15 additions & 0 deletions docs/analytics_collector.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,19 @@ These methods track the health of LLM API calls and output processing.
# To record an error when the LLM's JSON output fails schema validation
analytics_collector.record_llm_output_validation_error()

**Token Usage Tracking**

The collector automatically aggregates token usage from supported LLM providers (e.g., OpenAI, Gemini).

.. code-block:: python

# This is typically called internally by the LLM client
analytics_collector.record_llm_usage(input_tokens=150, output_tokens=50)

# Access totals
print(f"Total Input: {analytics_collector.total_input_tokens}")
print(f"Total Output: {analytics_collector.total_output_tokens}")

**Consensus Run Details**

This method is used to log the outcome of a consensus process. It takes a dictionary of metrics, usually generated by the ``JSONConsensus`` component.
Expand Down Expand Up @@ -137,6 +150,8 @@ A report provides a detailed summary of the entire workflow.
"llm_output_parse_errors": 2,
"llm_output_validation_errors": 1,
"total_invalid_parsing_errors": 3,
"total_input_tokens": 1500,
"total_output_tokens": 450,
"number_of_consensus_runs": 1,
"hydrated_objects_successes": 10,
"hydration_failures": 0,
Expand Down
Loading
Loading