Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs/extraction/audio.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ to transcribe speech to text, which is then embedded by using the Nemotron embed

!!! important

Due to limitations in available VRAM controls in the current release, the RIVA ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). For the full list of requirements, refer to [Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html).
Due to limitations in available VRAM controls in the current release, the RIVA ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). For the full list of requirements, refer to [Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix/support-matrix.html).

This pipeline enables users to retrieve speech files at the segment level.

Expand Down
74 changes: 37 additions & 37 deletions docs/docs/extraction/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,20 +35,20 @@ Before you use this documentation, you need the following:
### Run Your First Test

```bash
# 1. Navigate to the nemo-retriever-bench directory
# 1. Navigate to the harness directory
cd tools/harness

# 2. Install dependencies
uv sync

# 3. Run with a pre-configured dataset (assumes services are running)
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Or use a custom path that uses the "active" configuration
uv run nemo-retriever-bench --case=e2e --dataset=/path/to/your/data
uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data

# With managed infrastructure (starts/stops services)
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
```

## Configuration System
Expand Down Expand Up @@ -144,13 +144,13 @@ datasets:
**Usage:**
```bash
# Single dataset - configs applied automatically
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Multiple datasets (sweeping) - each gets its own config
uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Custom path still works (uses active section config)
uv run nemo-retriever-bench --case=e2e --dataset=/custom/path
uv run nv-ingest-harness-run --case=e2e --dataset=/custom/path
```

**Dataset Extraction Settings:**
Expand All @@ -176,7 +176,7 @@ Example:
# YAML active section has api_version: v1
# Dataset bo767 has extract_images: false
# Override via environment variable (highest priority)
EXTRACT_IMAGES=true API_VERSION=v2 uv run nemo-retriever-bench --case=e2e --dataset=bo767
EXTRACT_IMAGES=true API_VERSION=v2 uv run nv-ingest-harness-run --case=e2e --dataset=bo767
# Result: Uses bo767 path, but extract_images=true (env override) and api_version=v2 (env override)
```

Expand Down Expand Up @@ -240,13 +240,13 @@ Configuration is validated on load with helpful error messages.

```bash
# Run with default YAML configuration (assumes services are running)
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# With document-level analysis
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --doc-analysis
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --doc-analysis

# With managed infrastructure (starts/stops services)
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
```

### Dataset Sweeping
Expand All @@ -255,21 +255,21 @@ Run multiple datasets in a single command - each dataset automatically gets its

```bash
# Sweep multiple datasets
uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Each dataset runs sequentially with its own:
# - Extraction settings (from dataset config)
# - Artifact directory (timestamped per dataset)
# - Results summary at the end

# With managed infrastructure (services start once, shared across all datasets)
uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20 --managed
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20 --managed

# E2E+Recall sweep (each dataset ingests then evaluates recall)
uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767,earnings
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings

# Recall-only sweep (evaluates existing collections)
uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings
uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings
```

**Sweep Behavior:**
Expand All @@ -283,10 +283,10 @@ uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings

```bash
# Override via environment (useful for CI/CD)
API_VERSION=v2 EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e
API_VERSION=v2 EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e

# Temporary changes without editing YAML
DATASET_DIR=/custom/path uv run nemo-retriever-bench --case=e2e
DATASET_DIR=/custom/path uv run nv-ingest-harness-run --case=e2e
```

## Test Scenarios
Expand Down Expand Up @@ -472,23 +472,23 @@ recall:
```bash
# Evaluate existing bo767 collections (no reranker)
# recall_dataset automatically set from dataset config
uv run nemo-retriever-bench --case=recall --dataset=bo767
uv run nv-ingest-harness-run --case=recall --dataset=bo767

# With reranker only (set reranker_mode in YAML recall section)
uv run nemo-retriever-bench --case=recall --dataset=bo767
uv run nv-ingest-harness-run --case=recall --dataset=bo767

# Sweep multiple datasets for recall evaluation
uv run nemo-retriever-bench --case=recall --dataset=bo767,earnings
uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings
```

**E2E + Recall (fresh ingestion):**
```bash
# Fresh ingestion with recall evaluation
# recall_dataset automatically set from dataset config
uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767

# Sweep multiple datasets (each ingests then evaluates)
uv run nemo-retriever-bench --case=e2e_recall --dataset=bo767,earnings
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings
```

**Dataset configuration:**
Expand Down Expand Up @@ -536,7 +536,7 @@ The easiest way to test multiple datasets is using dataset sweeping:

```bash
# Test multiple datasets - each gets its native config automatically
uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Each dataset runs with its pre-configured extraction settings
# Results are organized in separate artifact directories
Expand All @@ -547,26 +547,26 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767,earnings,bo20
To sweep through different parameter values:

1. **Edit** `test_configs.yaml` - Update values in the `active` section
2. **Run** the test: `uv run nemo-retriever-bench --case=e2e --dataset=<name>`
2. **Run** the test: `uv run nv-ingest-harness-run --case=e2e --dataset=<name>`
3. **Analyze** results in `artifacts/<test_name>_<timestamp>/`
4. **Repeat** steps 1-3 for next parameter combination

Example parameter sweep workflow:
```bash
# Test 1: Baseline V1
vim test_configs.yaml # Set: api_version=v1, extract_tables=true
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 2: V2 with 32-page splitting
vim test_configs.yaml # Set: api_version=v2, pdf_split_page_count=32
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 3: V2 with 8-page splitting
vim test_configs.yaml # Set: pdf_split_page_count=8
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 4: Tables disabled (override via env var)
EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e --dataset=bo767
EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e --dataset=bo767
```

**Note**: Each test run creates a new timestamped artifact directory, so you can compare results across sweeps.
Expand All @@ -576,7 +576,7 @@ EXTRACT_TABLES=false uv run nemo-retriever-bench --case=e2e --dataset=bo767
### Attach Mode (Default)

```bash
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767
```

- **Default behavior**: Assumes services are already running
Expand All @@ -588,7 +588,7 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767
### Managed Mode

```bash
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed
```

- Starts Docker services automatically
Expand All @@ -600,10 +600,10 @@ uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed
**Managed mode options:**
```bash
# Skip Docker image rebuild (faster startup)
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed --no-build
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --no-build

# Keep services running after test (useful for multi-test scenarios)
uv run nemo-retriever-bench --case=e2e --dataset=bo767 --managed --keep-up
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --keep-up
```

## Artifacts and Logging
Expand Down Expand Up @@ -631,7 +631,7 @@ tools/harness/artifacts/<test_name>_<timestamp>_UTC/
Enable per-document element breakdown:

```bash
uv run nemo-retriever-bench --case=e2e --doc-analysis
uv run nv-ingest-harness-run --case=e2e --doc-analysis
```

**Sample Output:**
Expand Down Expand Up @@ -812,7 +812,7 @@ The framework is dataset-agnostic and supports multiple approaches:
**Option 1: Use pre-configured dataset (Recommended)**
```bash
# Dataset configs automatically applied
uv run nemo-retriever-bench --case=e2e --dataset=bo767
uv run nv-ingest-harness-run --case=e2e --dataset=bo767
```

**Option 2: Add new dataset to YAML**
Expand All @@ -827,17 +827,17 @@ datasets:
extract_infographics: false
recall_dataset: null # or set to evaluator name if applicable
```
Then use: `uv run nemo-retriever-bench --case=e2e --dataset=my_dataset`
Then use: `uv run nv-ingest-harness-run --case=e2e --dataset=my_dataset`

**Option 3: Use custom path (uses active section config)**
```bash
uv run nemo-retriever-bench --case=e2e --dataset=/path/to/your/dataset
uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/dataset
```

**Option 4: Environment variable override**
```bash
# Override specific settings via env vars
EXTRACT_IMAGES=true uv run nemo-retriever-bench --case=e2e --dataset=bo767
EXTRACT_IMAGES=true uv run nv-ingest-harness-run --case=e2e --dataset=bo767
```

**Best Practice**: For repeated testing, add your dataset to the `datasets` section with its native extraction settings. This ensures consistent configuration and enables dataset sweeping.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ nemo-retriever \
To submit a .pdf file with both a splitting task and an extraction task, run the following code.

!!! note
Currently, `split` only works for pdfium, nemotron-parse, and Unstructured.io.
Currently, `split` only works for pdfium and nemotron-parse.

```bash
nemo-retriever \
Expand Down
6 changes: 3 additions & 3 deletions docs/docs/extraction/content-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ Describes the structural location of content within a document.
| `span` | `int` | `-1` | Span identifier within a line, for finer granularity. |
| `nearby_objects` | `NearbyObjectsSchema` | `NearbyObjectsSchema()` | Information about objects (text, images, structured data) near the current content. See [NearbyObjectsSchema](#nearbyobjectsschema). |

### `NearbyObjectsSchema` (Currently Unused)
### `NearbyObjectsSchema` (Currently Unused) {#nearbyobjectsschema}
Container for different types of nearby objects.

| Field | Type | Default Value | Description |
Expand Down Expand Up @@ -243,7 +243,7 @@ Specific metadata for audio content.
| `audio_transcript` | `str` | `""` | Transcript of the audio content. |
| `audio_type` | `str` | `""` | Type or format of the audio (e.g., `mp3`, `wav`). |

### `ErrorMetadataSchema` (Currently Unused)
### `ErrorMetadataSchema` (Currently Unused) {#errormetadataschema}
Metadata describing errors encountered during processing.

| Field | Type | Default Value | Description |
Expand All @@ -253,7 +253,7 @@ Metadata describing errors encountered during processing.
| `source_id` | `str` | `""` | Identifier of the source item that caused the error, if applicable. |
| `error_msg` | `str` | *Required* | The error message. |

### `InfoMessageMetadataSchema` (Currently Unused)
### `InfoMessageMetadataSchema` (Currently Unused) {#infomessagemetadataschema}
Informational messages related to processing.

| Field | Type | Default Value | Description |
Expand Down
22 changes: 11 additions & 11 deletions docs/docs/extraction/custom-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ For more information about the `Ingestor` class, see [Use the NeMo Retriever Lib
For more information about the `vdb_upload` method, see [Upload Data](data-store.md).

```python
from nemo_retriever.client import Ingestor
from nv_ingest_client.client.interface import Ingestor

hostname="localhost"
collection_name = "nemo_retriever_collection"
Expand Down Expand Up @@ -142,7 +142,7 @@ you can use the `content_metadata` field to filter search results.
The following example uses a filter expression to narrow results by department.

```python
from nemo_retriever.util.milvus import query
from nv_ingest_client.util.vdb.milvus import nvingest_retrieval

hostname="localhost"
collection_name = "nemo_retriever_collection"
Expand All @@ -156,15 +156,15 @@ queries = ["this is expensive"]
q_results = []
for que in queries:
q_results.append(
query(
[que],
collection_name,
milvus_uri=f"http://{hostname}:19530",
embedding_endpoint=f"http://{hostname}:8012/v1",
hybrid=sparse,
top_k=top_k,
model_name=model_name,
gpu_search=False,
nvingest_retrieval(
[que],
collection_name=collection_name,
milvus_uri=f"http://{hostname}:19530",
embedding_endpoint=f"http://{hostname}:8012/v1",
hybrid=sparse,
top_k=top_k,
model_name=model_name,
gpu_search=False,
_filter=filter_expr
)
)
Expand Down
3 changes: 1 addition & 2 deletions docs/docs/extraction/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,11 @@ For more information, refer to [Extract Specific Elements from PDFs](python-api-
```python
Ingestor(client=client)
.files("data/multimodal_test.pdf")
.extract(
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=True,
paddle_output_format="markdown",
extract_infographics=True,
text_depth="page"
)
Expand Down
Loading
Loading