Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 54 additions & 18 deletions docs/lexical-graph/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,18 @@ The lexical-graph also allows you to set the logging level and apply logging fil

### GraphRAGConfig

`GraphRAGConfig` allows you to configure LLMs, embedding models, and the extract and build processes.
`GraphRAGConfig` is a module-level singleton (not a class to instantiate). It is created once at import time ([`config.py`](../../lexical-graph/src/graphrag_toolkit/lexical_graph/config.py#L1171)) and shared across the process. Set attributes directly on the imported object:

**Important**: If you want to change any of these values, do so early in your code, prior to creating a graph store or vector store.
```python
from graphrag_toolkit.lexical_graph import GraphRAGConfig

GraphRAGConfig.aws_region = 'eu-west-1'
GraphRAGConfig.extraction_llm = 'anthropic.claude-3-5-sonnet-20241022-v2:0'
```

Setting `aws_profile` or `aws_region` automatically clears all cached boto3 clients.

**Important**: Change configuration values early in your code, before creating any graph store or vector store.

The configuration includes the following parameters:

Expand All @@ -40,21 +49,37 @@ The configuration includes the following parameters:
| `build_batch_size` | The number of input nodes to be processed in parallel across all workers in the build stage | `4` | `BUILD_BATCH_SIZE` |
| `build_batch_write_size` | The number of elements to be written in a bulk operation to the graph and vector stores (see [Batch writes](#batch-writes)) | `25` | `BUILD_BATCH_WRITE_SIZE` |
| `batch_writes_enabled` | Determines whether, on a per-worker basis, to write all elements (nodes and edges, or vectors) emitted by a batch of input nodes as a bulk operation, or singly, to the graph and vector stores (see [Batch writes](#batch-writes)) | `True` | `BATCH_WRITES_ENABLED` |
| `include_domain_labels` | Determines whether entities will have a domain-specific label (e.g. `Company`) as well as the [graph model's](./graph-model.md#entity-relationship-tier) `__Entity__` label | `False` | `DEFAULT_INCLUDE_DOMAIN_LABELS` |
| `include_domain_labels` | Determines whether entities will have a domain-specific label (e.g. `Company`) as well as the [graph model's](./graph-model.md#entity-relationship-tier) `__Entity__` label | `False` | `INCLUDE_DOMAIN_LABELS` |
| `include_local_entities` | Whether to include local-context entities in the graph | `False` | `INCLUDE_LOCAL_ENTITIES` |
| `include_classification_in_entity_id` | Whether to include an entity's classification in its graph node id | `True` | `INCLUDE_CLASSIFICATION_IN_ENTITY_ID` |
| `enable_versioning` | Whether to enable versioned updates (see [Versioned Updates](./versioned-updates.md)) | `False` | `ENABLE_VERSIONING` |
| `enable_cache` | Determines whether the results of LLM calls to models on Amazon Bedrock are cached to the local filesystem (see [Caching Amazon Bedrock LLM responses](#caching-amazon-bedrock-llm-responses)) | `False` | `ENABLE_CACHE` |
| `aws_profile` | AWS CLI named profile used to authenticate requests to Bedrock and other services | *None* | `AWS_PROFILE` |
| `aws_region` | AWS region used to scope Bedrock service calls | `us-east-1` | `AWS_REGION` |
| `aws_region` | AWS region used to scope Bedrock service calls | *Default boto3 session region* | `AWS_REGION` |

The following parameters configure the rerankers used by query retrievers:

| Parameter | Description | Default | Environment Variable |
| ------------- | ------------- | ------------- | ------------- |
| `reranking_model` | Local reranker model (mixedbread-ai) | `mixedbread-ai/mxbai-rerank-xsmall-v1` | `RERANKING_MODEL` |
| `bedrock_reranking_model` | Amazon Bedrock reranker model | `cohere.rerank-v3-5:0` | `BEDROCK_RERANKING_MODEL` |

The following parameter applies only when using Amazon OpenSearch Serverless as a vector store:

| Parameter | Description | Default | Environment Variable |
| ------------- | ------------- | ------------- | ------------- |
| `opensearch_engine` | OpenSearch kNN engine | `nmslib` | `OPENSEARCH_ENGINE` |

To set a configuration parameter in your application code:

```python
from graphrag_toolkit.lexical_graph import GraphRAGConfig

GraphRAGConfig.response_llm = 'anthropic.claude-3-haiku-20240307-v1:0'
GraphRAGConfig.response_llm = 'anthropic.claude-3-haiku-20240307-v1:0'
GraphRAGConfig.extraction_num_workers = 4
```

You can also set configuration parameters via environment variables, as per the variable names in the table above.
You can also set any of these via environment variables using the variable names in the tables above.

#### LLM configuration

Expand All @@ -68,8 +93,7 @@ The `extraction_llm` and `response_llm` configuration parameters accept three di
{
"model": "anthropic.claude-3-7-sonnet-20250219-v1:0",
"temperature": 0.0,
"max_tokens": 4096,
"streaming": true
"max_tokens": 4096
}
```

Expand All @@ -79,16 +103,20 @@ The `embed_model` configuration parameter accepts three different types of value

- You can pass an instance of a LlamaIndex `BaseEmbedding` object. This allows you to configure the lexical-graph for embedding backends other than Amazon Bedrock.
- You can pass the model name of an Amazon Bedrock model. For example: `amazon.titan-embed-text-v1`.
- You can pass a JSON string representation of a LlamaIndex `Bedrock` instance. For example:
- You can pass a JSON string representation of a LlamaIndex `BedrockEmbedding` instance. For example:

```
{
"model_name": "amazon.titan-embed-text-v2:0",
"dimensions": 512
"model_name": "amazon.titan-embed-text-v2:0"
}
```

When configuring an embedding model, you must also set the `embed_dimensions` configuration parameter.

When configuring an embedding model, you must also set the `embed_dimensions` configuration parameter to match the model's output dimensions. For example:

```python
GraphRAGConfig.embed_model = '{"model_name": "amazon.titan-embed-text-v2:0"}'
GraphRAGConfig.embed_dimensions = 512
```

#### Batch writes

Expand Down Expand Up @@ -116,21 +144,22 @@ The `graphrag_toolkit` provides two methods for configuring logging in your appl

#### set_logging_config

The `set_logging_config` method allows you to configure logging with a basic set of options, such as logging level and module filters. Wildcards are supported for module names, and you can pass either a single string or a list of strings for included or excluded modules. For example:
The `set_logging_config` method allows you to configure logging with a basic set of options, such as logging level and module filters. Wildcards are supported for module names, and you can pass either a single string or a list of strings for included or excluded modules. You can optionally provide a `filename` to write log output to a file in addition to stdout. For example:

```python
from graphrag_toolkit.lexical_graph import set_logging_config

set_logging_config(
logging_level='DEBUG', # or logging.DEBUG
debug_include_modules='graphrag_toolkit.lexical_graph.storage', # single string or list of strings
debug_exclude_modules=['opensearch', 'boto'] # single string or list of strings
debug_exclude_modules=['opensearch', 'boto'], # single string or list of strings
filename='output.log' # optional: also write logs to a file
)
```

#### set_advanced_logging_config

The `set_advanced_logging_config` method provides more advanced logging configuration options, including the ability to specify filters for included and excluded modules or messages based on logging levels. Wildcards are supported only for module names, and you can pass either a single string or a list of strings for modules. This method offers greater flexibility and control over the logging behavior.
The `set_advanced_logging_config` method provides more advanced logging configuration options, including the ability to specify filters for included and excluded modules or messages based on logging levels. Wildcards are supported for module names and included messages, and you can pass either a single string or a list of strings for modules or messages. This method offers greater flexibility and control over the logging behavior.

##### Parameters

Expand All @@ -141,6 +170,7 @@ The `set_advanced_logging_config` method provides more advanced logging configur
| `excluded_modules` | `dict[int, str \| list[str]]` | Modules to exclude from logging, grouped by logging level. Wildcards are supported. | `None` |
| `included_messages` | `dict[int, str \| list[str]]` | Specific messages to include in logging, grouped by logging level. Wildcards are supported. | `None` |
| `excluded_messages` | `dict[int, str \| list[str]]` | Specific messages to exclude from logging, grouped by logging level. | `None` |
| `filename` | `str` | If provided, log output is also written to this file in addition to stdout. | `None` |

##### Example Usage

Expand Down Expand Up @@ -186,6 +216,12 @@ export AWS_PROFILE=padmin
export AWS_REGION=us-east-1
```

If no profile or region is set explicitly, the system will fall back to environment variables or use the default AWS CLI configuration.
If no profile or region is set explicitly, the system falls back to environment variables or the default AWS CLI configuration.

See [Using AWS Profiles in `GraphRAGConfig`](./aws-profile.md) for more details on configuring and using AWS named profiles.

#### Resilient clients and SSO token refresh

All boto3 clients created by `GraphRAGConfig` are wrapped in a `ResilientClient` ([`config.py:94`](https://github.com/awslabs/graphrag-toolkit/blob/main/lexical-graph/src/graphrag_toolkit/lexical_graph/config.py#L94)). On `ExpiredToken`, `RequestExpired`, or `InvalidClientTokenId` errors the client is refreshed automatically and the call is retried.

See [Using AWS Profiles in `GraphRAGConfig`](./aws-profile.md) for more details on configuring and using **AWS named profiles** in the lexical-graph by leveraging the `GraphRAGConfig` class.
When an AWS SSO profile is in use, the client wrapper also validates the SSO token age. If the token is more than one hour old, it runs `aws sso login` automatically before retrying. This is relevant for long-running indexing jobs and any environment where SSO sessions can expire mid-run.
6 changes: 6 additions & 0 deletions docs/lexical-graph/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,12 @@ To fix, [enable access](https://docs.aws.amazon.com/bedrock/latest/userguide/mo

---

#### Importing the package patches llama_index async internals

When you import `graphrag_toolkit.lexical_graph`, the package patches `llama_index.core.async_utils.asyncio_run` unconditionally ([`__init__.py:34`](https://github.com/awslabs/graphrag-toolkit/blob/main/lexical-graph/src/graphrag_toolkit/lexical_graph/__init__.py#L34)). The patch makes LlamaIndex's internal async runner work inside Jupyter notebooks by re-using the existing event loop instead of creating a new one. If no running loop is found, it falls back to `asyncio.run()`. This can interact unexpectedly with other code using LlamaIndex in the same process, particularly if that code relies on `asyncio_run` starting a clean event loop. There is currently no opt-out.

---

#### WARNING:graph_store:Retrying query in x seconds because it raised ConcurrentModificationException

While indexing data in Amazon Neptune Database, Neptune can sometimes issue a `ConcurrentModificationException`. This occurs because multiple workers are attempting to [update the same set of vertices](https://docs.aws.amazon.com/neptune/latest/userguide/transactions-exceptions.html). The GraphRAG Toolkit automatically retries transactionsb that are cancelled because of a `ConcurrentModificationException`. If the maximum number of retries is exceeded and the indexing fails, consider reducing the number of workers in the build stage using [`GraphRAGConfig.build_num_workers`](./configuration.md#graphragconfig).
Expand Down
33 changes: 30 additions & 3 deletions docs/lexical-graph/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,21 +256,33 @@ The `IndexingConfig` object has the following parameters:

| Parameter | Description | Default Value |
| ------------- | ------------- | ------------- |
| `chunking` | A list of node parsers (e.g. LlamaIndex `SentenceSplitter`) to be used for chunking source documents. Set `chunking` to `None` to skip chunking. | `SentenceSplitter` with `chunk_size=256` and `chunk_overlap=20` |
| `chunking` | A list of node parsers (e.g. LlamaIndex `SentenceSplitter`) to be used for chunking source documents. Set `chunking` to `None` to skip chunking. | `SentenceSplitter` with `chunk_size=256` and `chunk_overlap=25` |
| `extraction` | An `ExtractionConfig` object specifying extraction options | `ExtractionConfig` with default values |
| `batch_config` | Batch configuration to be used if performing [batch extraction](./batch-extraction.md). If `batch_config` is `None`, the toolkit will perform chunk-by-chunk extraction. | `None` |
| `build` | A `BuildConfig` object specifying build options | `BuildConfig` with default values |
| `batch_config` | Batch configuration to be used if performing [batch extraction](./batch-extraction.md). If `batch_config` is `None`, the toolkit will perform chunk-by-chunk extraction. | `None` |

The `ExtractionConfig` object has the following parameters:

| Parameter | Description | Default Value |
| ------------- | ------------- | ------------- |
| `enable_proposition_extraction` | Perform proposition extraction before extracting topics, statements, facts and entities | `True` |
| `preferred_entity_classifications` | Comma-separated list of preferred entity classifications used to seed the entity extraction | `DEFAULT_ENTITY_CLASSIFICATIONS` |
| `infer_entity_classifications` | Determines whether to pre-process documents to identify significant domain entity classifications. Supply either `True` or `False`, or an `InferClassificationsConfig` object. | `False` |
| `preferred_topics` | List of preferred topic names (or a callable that returns them) supplied to the LLM to seed topic extraction. Accepts the same type as `preferred_entity_classifications`. | `[]` |
| `infer_entity_classifications` | Determines whether to pre-process documents to identify significant domain entity classifications. Supply either `True` or `False`, or an `InferClassificationsConfig` object. When `True`, an `InferClassifications` step runs as a **pre-processor** before the main extraction loop — one extra LLM round-trip per batch, not per document. | `False` |
| `extract_propositions_prompt_template` | Prompt used to extract propositions from chunks. If `None`, the [default extract propositions template](https://github.com/awslabs/graphrag-toolkit/blob/main/lexical-graph/src/graphrag_toolkit/lexical_graph/indexing/prompts.py#L29-L72) is used. See [Custom prompts](#custom-prompts) below. | `None` |
| `extract_topics_prompt_template` | Prompt used to extract topics, statements and entities from chunks. If `None`, the [default extract topics template](https://github.com/awslabs/graphrag-toolkit/blob/main/lexical-graph/src/graphrag_toolkit/lexical_graph/indexing/prompts.py#L74-L191) is used. See [Custom prompts](#custom-prompts) below. | `None` |


The `BuildConfig` object has the following parameters:

| Parameter | Description | Default Value |
| ------------- | ------------- | ------------- |
| `build_filters` | A `BuildFilters` object to include or exclude specific node types during the build stage | `BuildFilters()` |
| `include_domain_labels` | Whether to add a domain-specific label (e.g. `Company`) to entity nodes in addition to `__Entity__` | `None` (falls back to `GraphRAGConfig.include_domain_labels`) |
| `include_local_entities` | Whether to include local-context entities in the graph | `None` (falls back to `GraphRAGConfig.include_local_entities`) |
| `source_metadata_formatter` | A `SourceMetadataFormatter` instance for customising source metadata written to the graph | `DefaultSourceMetadataFormatter()` |
| `enable_versioning` | Whether to enable versioned updates. Overrides `GraphRAGConfig.enable_versioning` when set. | `None` |

The `InferClassificationsConfig` object has the following parameters:

| Parameter | Description | Default Value |
Expand Down Expand Up @@ -353,6 +365,21 @@ topic: topic

You can use [Amazon Bedrock batch inference](https://docs.aws.amazon.com/bedrock/latest/userguide/batch-inference.html) with the extract stage of the indexing process. See [Batch Extraction](./batch-extraction.md) for more details.

`BatchConfig` ([`indexing/extract/batch_config.py`](https://github.com/awslabs/graphrag-toolkit/blob/main/lexical-graph/src/graphrag_toolkit/lexical_graph/indexing/extract/batch_config.py)) accepts the following parameters:

| Parameter | Description | Required |
| ------------- | ------------- | ------------- |
| `role_arn` | ARN of the IAM role Bedrock will assume to run batch jobs | Yes |
| `region` | AWS region where batch jobs will run | Yes |
| `bucket_name` | S3 bucket for batch job input/output | Yes |
| `key_prefix` | S3 key prefix for job files | No |
| `s3_encryption_key_id` | KMS key ID for S3 object encryption | No |
| `subnet_ids` | VPC subnet IDs for the batch job network configuration | No |
| `security_group_ids` | VPC security group IDs | No |
| `max_batch_size` | Maximum records per batch job (Bedrock limit: 50,000; jobs under 100 records are skipped and processed inline) | `25000` |
| `max_num_concurrent_batches` | Maximum concurrent batch jobs per worker | `3` |
| `delete_on_success` | Whether to delete S3 job files after a successful run | `True` |

#### Metadata filtering

You can add metadata to source documents on ingest, and then use this metadata to filter documents during the extract and build stages. Source metadata is also used for metadata filtering when querying a lexical graph. See the [Metadata Filtering](./metadata-filtering.md) section for more details.
Expand Down
Loading