Skip to content

KodeCharya/DeepResearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔬 DeepResearcher

Advanced AI Research Assistant with Local LLM Integration and RAG Pipeline

✨ Features

  • 🤖 Local LLM Integration: Uses local language models via llama-cpp-python for privacy and control
  • 📚 RAG Pipeline: Retrieval-Augmented Generation with vector embeddings using FAISS
  • 🌐 Web Search: Live web search integration using DuckDuckGo
  • 📄 Multi-modal File Support:
    • PDF and DOCX document processing
    • Image OCR using Tesseract
    • Audio transcription with Faster-Whisper
    • Video analysis (audio + frame OCR)
  • 💻 Streamlit Web UI: Intuitive interface with drag & drop, chat, and settings
  • 💾 Persistent Storage: Save and load knowledge bases
  • 📊 Export Options: Download chat history and results

🚀 Quick Start

Prerequisites

  • Python 3.8+
  • Git
  • Tesseract OCR (for image processing)
  • FFmpeg (for audio/video processing)

Installation

  1. Clone the repository ```bash git clone https://github.com/Drxmukesh/DeepResearch.git cd deepresearcher ```

  2. Install dependencies ```bash pip install -r requirements.txt ```

  3. Install system dependencies

    Ubuntu/Debian: ```bash sudo apt-get update sudo apt-get install tesseract-ocr ffmpeg ```

    macOS: ```bash brew install tesseract ffmpeg ```

    Windows: "its only run in visual studio c++ build"

  4. Download LLM Models ```bash python scripts/setup_models.py ```

    Or manually download models to the models/ directory:

  5. Run the application ```bash streamlit run app.py ```

  6. Open your browser and navigate to http://localhost:8501

📖 Usage Guide

1. Upload Documents

  • Drag and drop files into the upload area
  • Supported formats: PDF, DOCX, TXT, PNG, JPG, MP3, WAV, MP4, AVI
  • Click "Process Files" to add them to your knowledge base

2. Configure Settings

Use the sidebar to adjust:

  • Model Selection: Choose your local LLM
  • Generation Parameters: Temperature, max tokens, context length
  • RAG Settings: Web search, similarity threshold
  • File Processing: Chunk size and overlap

3. Ask Questions

  • Type your research question in the chat interface
  • DeepResearcher will search your documents and the web
  • Get comprehensive answers with source citations

4. Manage Knowledge Base

  • View document count in the Knowledge Base tab
  • Clear or backup your vector store
  • Export chat history as Markdown

🏗️ Architecture

``` DeepResearcher/ ├── app.py # Main Streamlit application ├── config.py # Configuration settings ├── requirements.txt # Python dependencies ├── utils/ │ ├── llm_handler.py # Local LLM management │ ├── rag_pipeline.py # RAG implementation │ ├── file_processor.py # Multi-modal file processing │ ├── web_search.py # Web search functionality │ └── vector_store.py # FAISS vector database ├── models/ # Local LLM models ├── vector_store/ # Persistent vector storage └── scripts/ └── setup_models.py # Model download script ```

🔧 Configuration

Model Configuration

Edit config.py to customize:

  • Model paths and settings
  • RAG parameters
  • File processing options
  • Web search settings

Environment Variables

Create a .env file for sensitive settings: ``` OPENAI_API_KEY=your_key_here # Optional: for embedding models HUGGINGFACE_TOKEN=your_token # Optional: for model downloads ```

📊 Supported File Types

Type Extensions Processing Method
Documents PDF, DOCX, TXT Text extraction
Images PNG, JPG, JPEG, GIF OCR with Tesseract
Audio MP3, WAV, M4A, FLAC Transcription with Whisper
Video MP4, AVI, MOV, MKV Audio transcription + frame OCR

🚀 Advanced Features

Custom Prompts

Modify system prompts in utils/rag_pipeline.py to customize AI behavior.

Model Integration

Add new LLM models by:

  1. Placing model files in models/ directory
  2. Updating config.py with model paths
  3. Restarting the application

Vector Store Backends

Switch between FAISS and Chroma by modifying utils/vector_store.py.

🐛 Troubleshooting

Common Issues

Model Loading Errors:

  • Ensure model files are in the correct format (GGML/GGUF)
  • Check available RAM (models require 4-8GB)
  • Verify model paths in configuration

File Processing Errors:

  • Install system dependencies (Tesseract, FFmpeg)
  • Check file permissions and sizes
  • Ensure supported file formats

Web Search Issues:

  • Check internet connection
  • Verify DuckDuckGo search is not blocked
  • Try reducing max web results

Performance Optimization

For Better Speed:

  • Use quantized models (Q4_0, Q5_0)
  • Reduce context length for faster inference
  • Enable GPU acceleration if available

For Better Quality:

  • Use larger models (13B, 30B parameters)
  • Increase context length
  • Adjust temperature for more creative responses

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and test thoroughly
  4. Submit a pull request with detailed description

Development Setup

```bash

Install development dependencies

pip install -r requirements-dev.txt

Run tests

python -m pytest tests/

Format code

black . isort . ```

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support


DeepResearcher - Empowering research with local AI and advanced RAG capabilities! 🔬✨

About

Advanced AI Research Assistant with Local LLM Integration and RAG Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages