Tools for creating and maintaining Pinecone vector DB lookups for up‑to‑date documentation and code examples.
Note
This repository currently focuses on ingest tooling: splitting docs, building records, and upserting to a Pinecone index. Agent access (MCP tools) is planned.
AI coding agents work best with fresh, searchable technical context. ragtag-crew provides a small, scriptable toolchain to:
- Create a Pinecone index configured for text embedding
- Split markdown docs into header-aware, token-friendly chunks
- Split PDFs by extracting text (images discarded), then chunking as plain text
- Split JSON using
RecursiveJsonSplitter(chunks are JSON strings) - Parse YAML and split as JSON (via
yaml.safe_load) - Upsert rich records with document metadata into Pinecone namespaces
- Delete all records in a Pinecone namespace when you want to fully refresh it
The goal is a reliable pipeline to collect, update, and access documentation and example code as vector search context.
- Header-aware markdown parsing using LangChain text splitters
- PDF-to-text extraction using pypdf (no OCR)
- Tunable chunk size and overlap (defaults: 1792 / 128)
- Record schema optimized for retrieval with clear metadata fields
- Batch upserts to Pinecone with simple logging and dry-run mode
- Idempotent index creation helper
- Namespace delete utility to cleanly reset a namespace
- Document downloader to fetch sources into
docs/via thescripts/docs_config.pyhelpers backed byscripts/docs_config_data.yaml
Tip
Keep each documentation source in its own Pinecone namespace to simplify updates and deletions without cross-talk.
Four scripts power the pipeline today:
scripts/db_create.py– Creates the Pinecone indexragtag-dbwith an integrated embedding model and field map{"text": "chunk_content"}.scripts/split_text.py– Splits documents (markdown, text, pdf, json, yaml) into chunks and either upserts records to Pinecone or writes them to JSON (dry run). Also supports configuration-driven execution via--process <doc_id>tied toscripts/docs_config_data.yaml.scripts/ns_delete.py– Deletes all records from a specified Pinecone namespace on the configured index host.scripts/doc_dwnld.py– Downloads configured documents to the localdocs/folder using the entries inscripts/docs_config.py.
Each chunk becomes a record shaped like:
Important
The index created by db_create.py maps the embedding model’s text field to chunk_content. If you change field names, update the index configuration and the ingestion code together.
For non-markdown inputs (text, pdf, json, yaml), chunk_section_id is not included in records.
- Python 3.13 or newer (see
pyproject.toml) - A Pinecone account and API key
Environment variable:
PINECONE_API_KEY– required for any operation that talks to Pinecone
Install minimal dependencies into a virtual environment:
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install langchain-text-splitters pinecone pypdf PyYAML
# Alternatively, install from this repo (uses pyproject.toml dependencies):
pip install -e .- Set your API key in the environment:
export PINECONE_API_KEY=… - The Pinecone index name used by
db_create.pyisragtag-db(AWS / us-east-1). - Provide your Pinecone index host in one of two ways:
- Preferred: set an environment variable
export PINECONE_HOST=https://<index>.svc.<project>.pinecone.io - Or pass
--host https://<index>.svc.<project>.pinecone.ioto the CLI
- Preferred: set an environment variable
- The docs configuration enforces file suffixes; when
input-formatistext, thedocument-pathmust end with.txt.
Important
All scripts accept or default to the Pinecone host via --host or PINECONE_HOST. This ensures the tools can target
the correct index across projects and regions without editing code.
This is idempotent; it’s safe to re-run.
export PINECONE_API_KEY=…
python -m scripts.db_create
# or
python scripts/db_create.pyYou can operate the splitter in two modes: manual (explicit CLI flags) and config-driven (based on scripts/docs_config_data.yaml).
List configured documents:
python -m scripts.split_text --listProcess a configured document (downloads if missing, then splits using config metadata):
python -m scripts.split_text --process fastmcp-docs --dry-run --host https://your-index.svc.your-project.pinecone.ioThe --process command pulls the document settings from scripts/docs_config_data.yaml, ensures the local file exists (triggering scripts.doc_dwnld when necessary), and then forwards the run to the splitter with the resolved parameters.
Manual mode remains available when you want to run the splitter against ad-hoc inputs.
scripts/split_text.py manual-mode CLI flags:
--document-id(required)--document-url(required)--document-path(required, path to a file)--input-format(optional, one ofmarkdown,text,pdf,json,yaml; defaultmarkdown)--pinecone-namespaceor--namespace(required)--host(required unlessPINECONE_HOSTis set) – Pinecone index host URL--dry-run(optional) to write JSON instead of upserting--output(optional, JSON path for--dry-run, defaultlogs/document_chunks.json)
Example (manual dry run, markdown):
export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.split_text \
--dry-run \
--document-id cribl-fastmcp \
--document-url https://example.com/fastmcp \
--document-path docs/fastmcp-llms-full.txt.md \
--input-format markdown \
--namespace cribl
# Inspect output
wc -l logs/document_chunks.jsonOr pass host explicitly:
python -m scripts.split_text \
--dry-run \
--document-id cribl-fastmcp \
--document-url https://example.com/fastmcp \
--document-path docs/fastmcp-llms-full.txt.md \
--input-format markdown \
--namespace cribl \
--host https://your-index.svc.your-project.pinecone.ioExample (manual upsert to Pinecone):
export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.split_text \
--document-id cribl-fastmcp \
--document-url https://example.com/fastmcp \
--document-path docs/fastmcp-llms-full.txt.md \
--input-format markdown \
--namespace criblPDFs are converted to text first (no OCR; images are ignored), then split as plain text.
Dry run:
python -m scripts.split_text \
--dry-run \
--document-id cribl-edge-pdf \
--document-url https://example.com/edge-pdf \
--document-path docs/cribl-edge-docs-4.13.3.pdf \
--input-format pdf \
--namespace cribl \
--host https://your-index.svc.your-project.pinecone.ioUpsert:
python -m scripts.split_text \
--document-id cribl-edge-pdf \
--document-url https://example.com/edge-pdf \
--document-path docs/cribl-edge-docs-4.13.3.pdf \
--input-format pdf \
--namespace cribl \
--host https://your-index.svc.your-project.pinecone.ioTip
Check logs at logs/split_text.<YYYY-MM-DD>.log for per-run details and a compact summary of the upsert response.
Use the downloader to fetch the sources defined in scripts/docs_config_data.yaml (loaded through scripts/docs_config.py) into your local docs/ directory.
List available document ids:
python -m scripts.doc_dwnld --listThe --list output is tab-separated with three columns per line:
<id><input-format>(one ofmarkdown,text,pdf,json,yaml)<document-url>
Download a single document by id:
python -m scripts.doc_dwnld --id fastmcp-docsDownload all configured documents:
python -m scripts.doc_dwnld --allNotes:
- If an entry has
input-formatset totextand itsdocument-urldoes not end with.txt, the page is fetched as HTML and converted to plain text by stripping tags, then saved asdocs/<document-id>.txt(for example,cribl-apibecomesdocs/cribl-api.txt). - Validation also expects
document-pathforinput-format: textentries to already point to a.txtfile, matching the downloader's output. - For other formats, the file is saved to the configured
document-path(typically underdocs/). - Existing files are overwritten when downloads occur.
Use this when you want to fully refresh a namespace before re-importing.
Using environment variables for host:
export PINECONE_API_KEY=…
export PINECONE_HOST=https://your-index.svc.your-project.pinecone.io
python -m scripts.ns_delete --namespace criblOr pass host explicitly:
python -m scripts.ns_delete \
--namespace cribl \
--host https://your-index.svc.your-project.pinecone.ioProgrammatic usage:
from scripts.ns_delete import delete_namespace_records
delete_namespace_records(
namespace="cribl",
host="https://your-index.svc.your-project.pinecone.io",
)Dry run:
python -m scripts.split_text \
--dry-run \
--document-id my-json \
--document-url https://example.com/my-json \
--document-path docs/sample.json \
--input-format json \
--namespace myns \
--host https://your-index.svc.your-project.pinecone.ioDry run:
python -m scripts.split_text \
--dry-run \
--document-id my-yaml \
--document-url https://example.com/my-yaml \
--document-path docs/sample.yaml \
--input-format yaml \
--namespace myns \
--host https://your-index.svc.your-project.pinecone.ioUse a consistent document_id and the same namespace. Since records use _id = "<document_id>:chunk<idx>", re-running ingestion with the updated document will overwrite per-chunk records. For large structural changes, consider clearing the namespace first using scripts/ns_delete.py.
Run scripts in place; no build step required.
The repository includes a pytest suite covering the document downloader, configuration helpers, namespace deletion, and text splitting utilities. Execute the tests from the project root:
pytestUse pytest -k <name> to focus on specific modules under tests/scripts/ when iterating on a single helper.
Optional linting (if you have ruff installed):
ruff check .-
Missing
PINECONE_API_KEY- Symptom: RuntimeError("PINECONE_API_KEY is not set in the environment")
- Fix:
export PINECONE_API_KEY=…and retry.
-
Missing or wrong Pinecone host
- Symptom: CLI exits with
--host is required (or set PINECONE_HOST…)or connection errors to Pinecone. - Fix: Set
export PINECONE_HOST=…or pass--host …to the scripts. Double-check the host URL for your index.
- Symptom: CLI exits with
-
Empty or tiny chunk count
- Symptom: Few records produced.
- Fix: Adjust
chunk_size/chunk_overlapinscripts/split_text.py.
docs/– example input documents (markdown/PDF/YAML/Text)- LangChain Text Splitters
- Pinecone Python SDK
Status: early-stage; interfaces may change. Feedback and issues welcome.
{ "_id": "<document_id>:chunk<idx>", "document_id": "<your-doc-id>", "document_url": "https://…", "document_date": "YYYY-MM-DD", // run date "chunk_content": "…", // text used for embedding (per index field map) "chunk_section_id": "Header_1|Header_2|…" // joined header path (present for markdown; omitted for non-markdown) }