PyPI: https://pypi.org/project/datasage-mds/
Video: https://youtu.be/mDjG_x7xhiY
A lightweight, modular Python package for building Retrieval-Augmented Generation (RAG) systems. DataSage enables you to query your documents using natural language by combining semantic search with large language models (LLMs).
- Document Ingestion: Support for multiple file formats (CSV, XLSX, PDF, TXT).
- Efficient Chunking: Configurable text splitting with overlap for context preservation.
- Vector Storage: ChromaDB-backed vector database for efficient similarity search.
- Semantic Search: HuggingFace embeddings for accurate document retrieval.
- LLM Integration: Local LLM support via Ollama for answer generation.
- Modular Architecture: Easy to extend and customize components.
DataSage
├── Ingestion Layer → Load and chunk documents
├── Indexing Layer → Embed and store in vector database
├── Query Layer → Retrieve relevant context and generate answers
└── RAG Pipeline → End-to-end question answering system
- Python 3.10 or higher
- Ollama (for local LLM inference)
Package is published on PyPI:
https://pypi.org/project/datasage-mds/
### 1. Install datasage
pip install datasage-mds
### 2. Install Ollama
Download and install Ollama from [ollama.com](https://ollama.com/download).
Once installed, in a separate terminal do the following:
Pull a model:
```bash
ollama pull llama3.1Verify installation:
ollama run llama3.1- CSV: Loaded with metadata for each row
- PDF: Extracted page by page
- TXT: Loaded as single document
- XLSX: Extracted sheet by sheet
- Document Q&A: Query large documents using natural language
- Knowledge Base Search: Build searchable knowledge bases
- Customer Support: Answer questions from documentation
- Research Assistant: Extract information from academic papers
- Code Documentation: Query codebases and technical docs
- Sub-package: ingestion
- Modules: loaders.py, chunker.py
- Sub-package: indexing
- Modules: embedder.py, vector_store.py, index_engine.py
- Sub-package: retrieval
- Modules: rag_engine/init.py, generator.py, retriever.py, data_models.py
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Built with LangChain
- Embeddings powered by HuggingFace
- Vector storage by ChromaDB
- Local LLM inference via Ollama
For questions or support, please open an issue on GitHub.
Made with ❤️ by the DataSage Team
datasage_data533_step_3
├─ .DS_Store
├─ coverage.json
├─ datasage_store
│ └─ chroma.sqlite3
├─ main.py
├─ project_description.pdf
├─ rag_engine
│ ├─ .DS_Store
│ ├─ indexing
│ │ ├─ embedder.py
│ │ ├─ indexing_documentation_updated.md
│ │ ├─ index_engine.py
│ │ ├─ testing_readme.md
│ │ └─ vector_store.py
│ ├─ ingestion
│ │ ├─ chunker.py
│ │ ├─ coverage_ingestion
│ │ │ ├─ coveragehtml_ingestion.png
│ │ │ └─ coverage_ingestion.png
│ │ ├─ documentation.md
│ │ ├─ loaders.py
│ │ ├─ README.md
│ │ └─ __init__.py
│ ├─ retrieval
│ │ ├─ data_models.py
│ │ ├─ documentation.md
│ │ ├─ generator.py
│ │ ├─ README.md
│ │ ├─ retriever.py
│ │ └─ __init__.py
│ ├─ tests
│ │ ├─ coverage_report.png
│ │ ├─ test_csv_loader.py
│ │ ├─ test_data_models.py
│ │ ├─ test_embedder.py
│ │ ├─ test_generator.py
│ │ ├─ test_index_engine.py
│ │ ├─ test_pdf_loader.py
│ │ ├─ test_retriever.py
│ │ ├─ test_text_chunker.py
│ │ ├─ test_txt_loader.py
│ │ ├─ test_vector_store.py
│ │ └─ __init__.py
│ ├─ rag_engine.py
│ └─ __init__.py
├─ README.md
├─ pyproject.toml
├─ requirements.txt
├─ search_test.txt
├─ test_data.csv
└─ utils_test.txt