A powerful document search and question-answering system built with AWS Bedrock, LangChain, Vhroma, and Streamlit. Upload your documents, ask questions in natural language, and get AI-powered answers with source citations.
CLICK HERE TO USE THE STRANDS VERSION
- π Multi-format Support: Upload PDF, TXT, and Markdown files
- π Intelligent Search: Vector-based similarity search using AWS Bedrock embeddings
- π¬ Natural Language Q&A: Ask questions and get contextual answers from your documents
- π Source Citations: See which documents were used to generate each answer
- π Web Interface: Easy-to-use Streamlit interface
- ποΈ File Management: Upload, view, and delete documents with ease
- π Real-time Indexing: Re-index your knowledgebase whenever you add new documents
Documents (PDF/TXT/MD) β Text Extraction β Chunking β Vector Embeddings β ChromaDB
β
User Question β Similarity Search β Relevant Chunks β AWS Bedrock LLM β Answer + Sources
- AWS Account with Bedrock access
- Python 3.8+
- AWS CLI configured or environment variables set
-
Clone the repository
git clone <your-repo-url> cd streamlit-kb
-
Install dependencies
Option A: Using pip (traditional)
pip install -r requirements.txt
Option B: Using uv (faster, recommended)
# Install uv if you haven't already curl -LsSf https://astral.sh/uv/install.sh | sh # Install dependencies with uv uv pip install -r requirements.txt
-
Set up AWS credentials (choose one method):
Option A: AWS CLI
aws configure
Option B: Environment variables
export AWS_ACCESS_KEY_ID=your_access_key export AWS_SECRET_ACCESS_KEY=your_secret_key export AWS_DEFAULT_REGION=us-west-2
Option C: .env file
# Create .env file in project root AWS_ACCESS_KEY_ID=your_access_key AWS_SECRET_ACCESS_KEY=your_secret_key AWS_DEFAULT_REGION=us-west-2 -
Enable Bedrock Models (in AWS Console)
- Go to AWS Bedrock Console
- Navigate to "Model access"
- Enable these models:
amazon.titan-embed-text-v1(for embeddings)us.amazon.nova-micro-v1:0(for Q&A)
-
Run the application
streamlit run app.py
-
Open your browser to
http://localhost:8501
- Navigate to the "π€ Upload Files" tab
- Upload PDF, TXT, or MD files
- Click "π Re-index Knowledgebase" to process documents
- Go to the "π¬ Ask Questions" tab
- Type your question in natural language
- Click "Generate Answer" to get AI-powered responses
- Review source documents to verify accuracy
- Use the "ποΈ Delete Files" tab to remove documents
- Re-index after deleting files to update the search index
You can modify these constants in app.py:
# Embedding model for document vectorization
AWS_BEDROCK_EMBEDDING_MODEL_ID = "amazon.titan-embed-text-v1"
# Language model for question answering
AWS_BEDROCK_LLM_MODEL_ID = "us.amazon.nova-micro-v1:0"
# AWS region
AWS_REGION = "us-west-2"
# Text chunking parameters
chunk_size = 500 # Characters per chunk
chunk_overlap = 50 # Overlap between chunksproject/
βββ app.py # Main application
βββ data/ # Uploaded documents (auto-created)
βββ requirements.txt # Python dependencies
βββ .env # AWS credentials (optional)
βββ README.md # This file
Create a requirements.txt file with:
streamlit>=1.28.0
langchain>=0.1.0
langchain-community>=0.0.10
boto3>=1.34.0
chromadb>=0.4.0
PyPDF2>=3.0.0
python-dotenv>=1.0.0- Text Extraction: PDFs are converted to text using PyPDF2
- Chunking: Documents are split into ~500 character chunks with 50 character overlap
- Vectorization: Each chunk is converted to embeddings using AWS Bedrock Titan
- Storage: Vectors are stored in ChromaDB for fast similarity search
- Query Processing: User question is converted to vector embedding
- Similarity Search: Find the 3 most relevant document chunks
- Context Assembly: Relevant chunks are sent to AWS Bedrock Nova Micro
- Answer Generation: LLM generates answer based on document context
- Source Attribution: Original document chunks are shown for verification
- Temporary Storage: ChromaDB uses temporary directories that are cleaned up automatically
- Local Processing: Documents are processed locally before sending to AWS
- AWS IAM: Ensure your AWS credentials have minimal required Bedrock permissions
- Data Privacy: Consider data sensitivity when using cloud AI services
Q: "No module named 'streamlit'"
pip install streamlitQ: "Unable to locate credentials"
- Verify AWS credentials are configured
- Check AWS region has Bedrock access
- Ensure Bedrock models are enabled in AWS Console
Q: "No valid documents found to index"
- Upload documents first via "Upload Files" tab
- Ensure files are PDF, TXT, or MD format
- Check that files have readable content
Q: "Error generating answer"
- Re-index your knowledgebase
- Verify Bedrock models are enabled
- Check AWS credentials and permissions
Add this to see more detailed logs:
import logging
logging.basicConfig(level=logging.DEBUG)streamlit run app.py- Streamlit Cloud: Connect your GitHub repo
- AWS EC2: Deploy on EC2 instance with IAM role
- Docker: Containerize the application
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- AWS Bedrock for AI models
- LangChain for AI framework
- Streamlit for web interface
- ChromaDB for vector storage
- π Issues: GitHub Issues
- π§ Email: patweb99@gmail.com
β Star this repo if you find it helpful!