Skip to content

atree1023/ragtag_crew

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ragtag-crew

Tools for creating and maintaining Pinecone vector DB lookups for up‑to‑date documentation and code examples.

Python Pinecone LangChain

Note

This repository currently focuses on ingest tooling: splitting docs, building records, and upserting to a Pinecone index. Agent access (MCP tools) is planned.

Overview

AI coding agents work best with fresh, searchable technical context. ragtag-crew provides a small, scriptable toolchain to:

  • Create a Pinecone index configured for text embedding
  • Split markdown docs into header-aware, token-friendly chunks
  • Split PDFs by extracting text (images discarded), then chunking as plain text
  • Split JSON using RecursiveJsonSplitter (chunks are JSON strings)
  • Parse YAML and split as JSON (via yaml.safe_load)
  • Upsert rich records with document metadata into Pinecone namespaces
  • Delete all records in a Pinecone namespace when you want to fully refresh it

The goal is a reliable pipeline to collect, update, and access documentation and example code as vector search context.

Features

  • Header-aware markdown parsing using LangChain text splitters
  • PDF-to-text extraction using pypdf (no OCR)
  • Tunable chunk size and overlap (defaults: 1792 / 128)
  • Record schema optimized for retrieval with clear metadata fields
  • Batch upserts to Pinecone with simple logging and dry-run mode
  • Idempotent index creation helper
  • Namespace delete utility to cleanly reset a namespace
  • Document downloader to fetch sources into docs/ via the scripts/docs_config.py helpers backed by scripts/docs_config_data.yaml

Tip

Keep each documentation source in its own Pinecone namespace to simplify updates and deletions without cross-talk.

How it works

Four scripts power the pipeline today:

  • scripts/db_create.py – Creates the Pinecone index ragtag-db with an integrated embedding model and field map {"text": "chunk_content"}.
  • scripts/split_text.py – Splits documents (markdown, text, pdf, json, yaml) into chunks and either upserts records to Pinecone or writes them to JSON (dry run). Also supports configuration-driven execution via --process <doc_id> tied to scripts/docs_config_data.yaml.
  • scripts/ns_delete.py – Deletes all records from a specified Pinecone namespace on the configured index host.
  • scripts/doc_dwnld.py – Downloads configured documents to the local docs/ folder using the entries in scripts/docs_config.py.

Data model (record schema)

Each chunk becomes a record shaped like:

{
  "_id": "<document_id>:chunk<idx>",
  "document_id": "<your-doc-id>",
  "document_url": "https://…",
  "document_date": "YYYY-MM-DD", // run date
  "chunk_content": "", // text used for embedding (per index field map)
  "chunk_section_id": "Header_1|Header_2|…" // joined header path (present for markdown; omitted for non-markdown)
}

Important

The index created by db_create.py maps the embedding model’s text field to chunk_content. If you change field names, update the index configuration and the ingestion code together.

For non-markdown inputs (text, pdf, json, yaml), chunk_section_id is not included in records.

Prerequisites

  • Python 3.13 or newer (see pyproject.toml)
  • A Pinecone account and API key

Environment variable:

  • PINECONE_API_KEY – required for any operation that talks to Pinecone

Installation

Install minimal dependencies into a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install langchain-text-splitters pinecone pypdf PyYAML

# Alternatively, install from this repo (uses pyproject.toml dependencies):
pip install -e .

Configuration

  • Set your API key in the environment: export PINECONE_API_KEY=…
  • The Pinecone index name used by db_create.py is ragtag-db (AWS / us-east-1).
  • Provide your Pinecone index host in one of two ways:
    • Preferred: set an environment variable export PINECONE_HOST=https://<index>.svc.<project>.pinecone.io
    • Or pass --host https://<index>.svc.<project>.pinecone.io to the CLI
  • The docs configuration enforces file suffixes; when input-format is text, the document-path must end with .txt.

Important

All scripts accept or default to the Pinecone host via --host or PINECONE_HOST. This ensures the tools can target the correct index across projects and regions without editing code.

Create the index

This is idempotent; it’s safe to re-run.

export PINECONE_API_KEY=…
python -m scripts.db_create
# or
python scripts/db_create.py

Ingest a document (split and upsert)

scripts/split_text.py

You can operate the splitter in two modes: manual (explicit CLI flags) and config-driven (based on scripts/docs_config_data.yaml).

List configured documents:

python -m scripts.split_text --list

Process a configured document (downloads if missing, then splits using config metadata):

python -m scripts.split_text --process fastmcp-docs --dry-run --host https://your-index.svc.your-project.pinecone.io

The --process command pulls the document settings from scripts/docs_config_data.yaml, ensures the local file exists (triggering scripts.doc_dwnld when necessary), and then forwards the run to the splitter with the resolved parameters.

Manual mode remains available when you want to run the splitter against ad-hoc inputs.

scripts/split_text.py manual-mode CLI flags:

  • --document-id (required)
  • --document-url (required)
  • --document-path (required, path to a file)
  • --input-format (optional, one of markdown, text, pdf, json, yaml; default markdown)
  • --pinecone-namespace or --namespace (required)
  • --host (required unless PINECONE_HOST is set) – Pinecone index host URL
  • --dry-run (optional) to write JSON instead of upserting
  • --output (optional, JSON path for --dry-run, default logs/document_chunks.json)

Example (manual dry run, markdown):

export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.split_text \
  --dry-run \
  --document-id cribl-fastmcp \
  --document-url https://example.com/fastmcp \
  --document-path docs/fastmcp-llms-full.txt.md \
  --input-format markdown \
  --namespace cribl

# Inspect output
wc -l logs/document_chunks.json

Or pass host explicitly:

python -m scripts.split_text \
  --dry-run \
  --document-id cribl-fastmcp \
  --document-url https://example.com/fastmcp \
  --document-path docs/fastmcp-llms-full.txt.md \
  --input-format markdown \
  --namespace cribl \
  --host https://your-index.svc.your-project.pinecone.io

Example (manual upsert to Pinecone):

export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.split_text \
  --document-id cribl-fastmcp \
  --document-url https://example.com/fastmcp \
  --document-path docs/fastmcp-llms-full.txt.md \
  --input-format markdown \
  --namespace cribl

Ingest a PDF

PDFs are converted to text first (no OCR; images are ignored), then split as plain text.

Dry run:

python -m scripts.split_text \
  --dry-run \
  --document-id cribl-edge-pdf \
  --document-url https://example.com/edge-pdf \
  --document-path docs/cribl-edge-docs-4.13.3.pdf \
  --input-format pdf \
  --namespace cribl \
  --host https://your-index.svc.your-project.pinecone.io

Upsert:

python -m scripts.split_text \
  --document-id cribl-edge-pdf \
  --document-url https://example.com/edge-pdf \
  --document-path docs/cribl-edge-docs-4.13.3.pdf \
  --input-format pdf \
  --namespace cribl \
  --host https://your-index.svc.your-project.pinecone.io

Tip

Check logs at logs/split_text.<YYYY-MM-DD>.log for per-run details and a compact summary of the upsert response.

Download documents

Use the downloader to fetch the sources defined in scripts/docs_config_data.yaml (loaded through scripts/docs_config.py) into your local docs/ directory.

List available document ids:

python -m scripts.doc_dwnld --list

The --list output is tab-separated with three columns per line:

  • <id>
  • <input-format> (one of markdown, text, pdf, json, yaml)
  • <document-url>

Download a single document by id:

python -m scripts.doc_dwnld --id fastmcp-docs

Download all configured documents:

python -m scripts.doc_dwnld --all

Notes:

  • If an entry has input-format set to text and its document-url does not end with .txt, the page is fetched as HTML and converted to plain text by stripping tags, then saved as docs/<document-id>.txt (for example, cribl-api becomes docs/cribl-api.txt).
  • Validation also expects document-path for input-format: text entries to already point to a .txt file, matching the downloader's output.
  • For other formats, the file is saved to the configured document-path (typically under docs/).
  • Existing files are overwritten when downloads occur.

Delete a namespace

Use this when you want to fully refresh a namespace before re-importing.

Using environment variables for host:

export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.ns_delete --namespace cribl

Or pass host explicitly:

python -m scripts.ns_delete \
  --namespace cribl \
  --host https://your-index.svc.your-project.pinecone.io

Programmatic usage:

from scripts.ns_delete import delete_namespace_records

delete_namespace_records(
    namespace="cribl",
    host="https://your-index.svc.your-project.pinecone.io",
)

Ingest JSON

Dry run:

python -m scripts.split_text \
  --dry-run \
  --document-id my-json \
  --document-url https://example.com/my-json \
  --document-path docs/sample.json \
  --input-format json \
  --namespace myns \
  --host https://your-index.svc.your-project.pinecone.io

Ingest YAML

Dry run:

python -m scripts.split_text \
  --dry-run \
  --document-id my-yaml \
  --document-url https://example.com/my-yaml \
  --document-path docs/sample.yaml \
  --input-format yaml \
  --namespace myns \
  --host https://your-index.svc.your-project.pinecone.io

Updating or re-importing docs

Use a consistent document_id and the same namespace. Since records use _id = "<document_id>:chunk<idx>", re-running ingestion with the updated document will overwrite per-chunk records. For large structural changes, consider clearing the namespace first using scripts/ns_delete.py.

Development

Run scripts in place; no build step required.

Testing

The repository includes a pytest suite covering the document downloader, configuration helpers, namespace deletion, and text splitting utilities. Execute the tests from the project root:

pytest

Use pytest -k <name> to focus on specific modules under tests/scripts/ when iterating on a single helper.

Optional linting (if you have ruff installed):

ruff check .

Troubleshooting

  • Missing PINECONE_API_KEY

    • Symptom: RuntimeError("PINECONE_API_KEY is not set in the environment")
    • Fix: export PINECONE_API_KEY=… and retry.
  • Missing or wrong Pinecone host

    • Symptom: CLI exits with --host is required (or set PINECONE_HOST…) or connection errors to Pinecone.
    • Fix: Set export PINECONE_HOST=… or pass --host … to the scripts. Double-check the host URL for your index.
  • Empty or tiny chunk count

    • Symptom: Few records produced.
    • Fix: Adjust chunk_size / chunk_overlap in scripts/split_text.py.

References


Status: early-stage; interfaces may change. Feedback and issues welcome.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages