Skip to content

Commit 6b78ccc

Browse files
authored
feat: add LLM enrichment for model name extraction (#36)
* feat: new llm-enrichment for ai-bom * fix: fixing tests
1 parent 4665274 commit 6b78ccc

File tree

14 files changed

+985
-4
lines changed

14 files changed

+985
-4
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,8 @@ The image is published to `ghcr.io/trusera/ai-bom` on every tagged release.
128128

129129
**25+ AI SDKs detected** across Python, JavaScript, TypeScript, Java, Go, Rust, and Ruby.
130130

131+
**Optional LLM enrichment** — use `--llm-enrich` to extract specific model names (e.g., gpt-4o, claude-3-opus) from code via OpenAI, Anthropic, or local Ollama models. See [docs/enrichment.md](docs/enrichment.md).
132+
131133
---
132134

133135
## Agent SDKs

docs/comparison.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ The goal is to help users understand feature differences and choose the right to
1515
| Scanners | 13+ (code, cloud, Docker, GitHub Actions, Jupyter, MCP, n8n, etc.) | 1 (Python-focused) | Unknown |
1616
| Output Formats | 9 (Table, JSON, SARIF, SPDX, CycloneDX, CSV, HTML, Markdown, JUnit) | JSON, CSV | Unknown |
1717
| CI/CD Integration | GitHub Action, GitLab CI | No | Yes |
18-
| LLM Enrichment | No | Yes | Early access / limited preview |
18+
| LLM Enrichment | Yes | Yes | Early access / limited preview |
1919
| n8n Scanning | Yes | No | No |
2020
| MCP / A2A Detection | Yes | No | No |
2121
| Agent Framework Detection | LangChain, CrewAI, AutoGen, LlamaIndex, Semantic Kernel | Limited | Unknown |
2222
| Binary Model Detection | Yes (.onnx, .pt, .safetensors, etc.) | No | Unknown |
2323
| Policy Enforcement | Cedar policy gate | No | Yes |
24-
| Best For | Multi-framework projects needing multiple formats | Python projects needing LLM enrichment | Existing Snyk customers |
24+
| Best For | Multi-framework projects needing multiple formats and optional LLM enrichment | Python projects needing LLM enrichment | Existing Snyk customers |
2525

2626
---
2727

@@ -31,6 +31,7 @@ The goal is to help users understand feature differences and choose the right to
3131

3232
- Open-source AI Bill of Materials scanner focused on discovering AI/LLM usage across codebases and infrastructure.
3333
- Supports multiple scanners, formats, and compliance mappings (OWASP Agentic Top 10, EU AI Act).
34+
- LLM enrichment (`--llm-enrich`) uses litellm to extract specific model names from code, supporting OpenAI, Anthropic, Ollama, and 100+ providers.
3435
- Designed for developer workflows with CLI, CI/CD, and dashboard support.
3536

3637
### Cisco AIBOM

docs/enrichment.md

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# LLM Enrichment
2+
3+
AI-BOM can optionally use an LLM to analyze code snippets around detected AI components and extract the specific model names being used (e.g., `gpt-4o`, `claude-3-opus-20240229`, `llama3`).
4+
5+
This fills the `model_name` field that static pattern matching may leave empty, particularly when model names are passed as variables or constructed dynamically.
6+
7+
---
8+
9+
## Installation
10+
11+
LLM enrichment requires the `litellm` package:
12+
13+
```bash
14+
pip install ai-bom[enrich]
15+
```
16+
17+
---
18+
19+
## Usage
20+
21+
### Basic
22+
23+
```bash
24+
ai-bom scan . --llm-enrich
25+
```
26+
27+
This uses `gpt-4o-mini` by default (requires `OPENAI_API_KEY` environment variable).
28+
29+
### With a specific model
30+
31+
```bash
32+
# OpenAI
33+
ai-bom scan . --llm-enrich --llm-model gpt-4o
34+
35+
# Anthropic
36+
ai-bom scan . --llm-enrich --llm-model anthropic/claude-3-haiku-20240307
37+
38+
# Local Ollama (no API key needed)
39+
ai-bom scan . --llm-enrich --llm-model ollama/llama3 --llm-base-url http://localhost:11434
40+
```
41+
42+
### With an explicit API key
43+
44+
```bash
45+
ai-bom scan . --llm-enrich --llm-api-key sk-your-key-here
46+
```
47+
48+
If `--llm-api-key` is not provided, litellm falls back to standard environment variables (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.).
49+
50+
---
51+
52+
## CLI Options
53+
54+
| Flag | Default | Description |
55+
|------|---------|-------------|
56+
| `--llm-enrich` | `False` | Enable LLM enrichment |
57+
| `--llm-model` | `gpt-4o-mini` | litellm model identifier |
58+
| `--llm-api-key` | None | API key (falls back to env vars) |
59+
| `--llm-base-url` | None | Custom API base URL (e.g., Ollama) |
60+
61+
---
62+
63+
## How It Works
64+
65+
1. After all scanners run, components with type `llm_provider` or `model` that have an empty `model_name` are selected for enrichment.
66+
2. For each eligible component, ~20 lines of code around the detection site are read from the source file.
67+
3. The code snippet is sent to the configured LLM with a prompt asking it to extract the model identifier.
68+
4. The response is parsed and cross-referenced with AI-BOM's built-in model registry to validate the name and fill in provider/deprecation metadata.
69+
5. If the LLM call fails or returns no model name, the component is left unchanged.
70+
71+
Components that already have a `model_name` (from static detection) are skipped. Non-model component types (containers, tools, MCP servers, workflows) are never sent to the LLM.
72+
73+
---
74+
75+
## Privacy and Security
76+
77+
**Code snippets are sent to the LLM provider.** When using cloud-hosted models (OpenAI, Anthropic, etc.), approximately 20 lines of source code around each detected AI import or usage site are transmitted to the provider's API.
78+
79+
Recommendations:
80+
81+
- **For sensitive or proprietary codebases**, use a local model via Ollama (`--llm-model ollama/llama3`). No data leaves your machine.
82+
- **Before using cloud APIs**, ensure you have organizational approval to send source code excerpts to the provider.
83+
- **Only code around detected AI components** is sent — not entire files, not the full repository.
84+
- AI-BOM does not intentionally include secrets in snippets, but if API keys are hard-coded near import statements, they may be included in the context window. Use `--deep` scanning to detect and remediate hard-coded keys separately.
85+
86+
A warning is printed when using non-local models:
87+
88+
```
89+
Warning: LLM enrichment sends code snippets to an external API.
90+
Use ollama/* models for local-only processing.
91+
```
92+
93+
---
94+
95+
## Cost
96+
97+
Each eligible component triggers one or more LLM API calls. For projects with many detected AI components, this can result in non-trivial API costs when using paid providers.
98+
99+
- Components are batched (default: 5 per call) to reduce the number of API requests.
100+
- Use a low-cost model like `gpt-4o-mini` for bulk enrichment.
101+
- **Ollama is free** — run models locally with zero API cost.
102+
103+
---
104+
105+
## Supported Providers
106+
107+
LLM enrichment uses [litellm](https://docs.litellm.ai/) as its backend, which supports 100+ LLM providers including:
108+
109+
- OpenAI (`gpt-4o`, `gpt-4o-mini`, etc.)
110+
- Anthropic (`anthropic/claude-3-haiku-20240307`, etc.)
111+
- Ollama (`ollama/llama3`, `ollama/mistral`, etc.)
112+
- Azure OpenAI, AWS Bedrock, Google Vertex AI
113+
- Mistral, Cohere, and many more
114+
115+
See the [litellm provider list](https://docs.litellm.ai/docs/providers) for the full list.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ nav:
6767
- CSV: outputs/csv.md
6868
- JUnit: outputs/junit.md
6969
- Markdown: outputs/markdown.md
70+
- LLM Enrichment: enrichment.md
7071
- CI/CD Integration: ci-integration.md
7172
- Policy Enforcement: policy.md
7273
- Compliance:

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ aws = ["boto3>=1.26.0,<2.0"]
7777
gcp = ["google-cloud-aiplatform>=1.38.0,<2.0"]
7878
azure = ["azure-ai-ml>=1.11.0,<2.0", "azure-identity>=1.12.0,<2.0"]
7979
cloud-live = ["ai-bom[aws,gcp,azure]"]
80+
enrich = ["litellm>=1.40.0,<3.0"]
8081
callable = [] # base callable module — no SDKs required
8182
callable-openai = ["openai>=1.0.0,<3.0"]
8283
callable-anthropic = ["anthropic>=0.30.0,<2.0"]
@@ -88,7 +89,7 @@ callable-cohere = ["cohere>=5.0.0,<7.0"]
8889
callable-all = [
8990
"ai-bom[callable-openai,callable-anthropic,callable-google,callable-bedrock,callable-ollama,callable-mistral,callable-cohere]",
9091
]
91-
all = ["ai-bom[dashboard,docs,server,watch,cloud-live,callable-all]"]
92+
all = ["ai-bom[dashboard,docs,server,watch,cloud-live,callable-all,enrich]"]
9293

9394
[project.scripts]
9495
ai-bom = "ai_bom.cli:app"

src/ai_bom/cli.py

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -403,12 +403,48 @@ def scan(
403403
"--telemetry/--no-telemetry",
404404
help="Enable/disable anonymous telemetry (overrides AI_BOM_TELEMETRY env var)",
405405
),
406+
llm_enrich: bool = typer.Option(
407+
False,
408+
"--llm-enrich",
409+
help="Use an LLM to extract model names from code snippets (requires ai-bom[enrich])",
410+
),
411+
llm_model: str = typer.Option(
412+
"gpt-4o-mini",
413+
"--llm-model",
414+
help="LLM model for enrichment (e.g. gpt-4o-mini, anthropic/claude-3-haiku, ollama/llama3)",
415+
),
416+
llm_api_key: Optional[str] = typer.Option(
417+
None,
418+
"--llm-api-key",
419+
help="API key for the LLM provider (falls back to provider env vars like OPENAI_API_KEY)",
420+
),
421+
llm_base_url: Optional[str] = typer.Option(
422+
None,
423+
"--llm-base-url",
424+
help="Custom base URL for LLM API (e.g. http://localhost:11434 for Ollama)",
425+
),
406426
) -> None:
407427
"""Scan a directory or repository for AI/LLM components."""
408428
# --json / -j overrides --format
409429
if json_output:
410430
format = "json"
411431

432+
# Validate --llm-enrich dependency early
433+
if llm_enrich:
434+
try:
435+
import litellm # noqa: F401
436+
except ImportError:
437+
console.print(
438+
"[red]LLM enrichment requires litellm. "
439+
"Install with: pip install ai-bom[enrich][/red]"
440+
)
441+
raise typer.Exit(EXIT_ERROR) from None
442+
if not quiet and not llm_model.startswith("ollama/"):
443+
console.print(
444+
"[yellow]Warning: LLM enrichment sends code snippets to an external API. "
445+
"Use ollama/* models for local-only processing.[/yellow]"
446+
)
447+
412448
# Setup logging
413449
_setup_logging(verbose=verbose, debug=debug)
414450

@@ -582,6 +618,23 @@ def scan(
582618
end_time = time.time()
583619
result.summary.scan_duration_seconds = end_time - start_time
584620

621+
# LLM enrichment (optional post-processing)
622+
if llm_enrich and result.components:
623+
from ai_bom.enrichment import enrich_components
624+
625+
if format == "table" and not quiet:
626+
console.print("[cyan]Running LLM enrichment...[/cyan]")
627+
enriched = enrich_components(
628+
result.components,
629+
scan_path=scan_path,
630+
model=llm_model,
631+
api_key=llm_api_key,
632+
base_url=llm_base_url,
633+
quiet=quiet,
634+
)
635+
if format == "table" and not quiet:
636+
console.print(f"[green]Enriched {enriched} component(s) with model names[/green]")
637+
585638
# Build summary
586639
result.build_summary()
587640

@@ -901,6 +954,10 @@ def demo() -> None:
901954
validate_schema=False,
902955
json_output=False,
903956
telemetry=None,
957+
llm_enrich=False,
958+
llm_model="gpt-4o-mini",
959+
llm_api_key=None,
960+
llm_base_url=None,
904961
)
905962

906963

src/ai_bom/enrichment/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""LLM-based enrichment for AI-BOM components."""
2+
3+
from ai_bom.enrichment.llm_enricher import enrich_components
4+
5+
__all__ = ["enrich_components"]

0 commit comments

Comments
 (0)