Note: Click on the picture above to play the video. If you want to download the video, it is located in the
Video_Demofolder.
AI-powered tool for automated mining and analysis of large PDF financial documents, featuring semantic search, question-answering, and multi-language translation.
- Faisal Budhwani
- Shantanu Joshi
- Siddharth Kulkarni
Financial documents like analyst reports, annual filings, and statements are often complex and unstructured, making it difficult and time-consuming to extract actionable insights. Our analyser leverages AI and automation to transform these documents into structured, searchable data—empowering users to quickly find answers, analyze trends, and make better decisions.
For more details, see the report.pdf in the docs folder.
- Automated PDF Extraction: Extracts text, tables, images, and complex layouts using AI and OCR.
- Semantic Search & QA: Ask questions in natural language and get accurate answers with source references, powered by vector search and finance-specific language models.
- AI Analysis: Identifies key financial metrics, summarizes sections, and highlights risk factors for deeper insights.
- Interactive Visualizations: Generates charts and dashboards for trends, comparisons, and summaries.
- Multi-Language Support: Translates documents while preserving formatting and financial terminology, supporting global teams and compliance needs.
- Integration Ready: Extracted data can be exported or integrated with databases and analytics tools.
-
Clone & Install
git clone https://github.com/siddharthck/255-pdf-data-mining.git cd 255-pdf-data-mining python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt -
Set API Key
- Add your OpenAI key to a
.envfile:
OPENAI_API_KEY=your_openai_api_key_here
- Add your OpenAI key to a
-
Run Streamlit
streamlit run main.pyAccess at http://localhost:8501
ALternatively you can also run FAST API + REACT (not stable) Run FastAPI
python -m uvicorn main.fastapi_app:app --reloadAccess at http://localhost:8000 Start Frontend
cd frontend-v1 npm install npm startAccess at http://localhost:3000
and start frontend
Note: The Streamlit UI is more stable and is recommended for use.
- "Summarize the company’s main revenue sources."
- "What are the key risk factors?"
- "Show revenue trends over the last 3 years."
- Faster, More Accurate Analysis: Automates extraction and reduces human error, delivering insights in seconds.
- Handles Complex Documents: Interprets tables, charts, and unstructured text—no manual copy-paste needed.
- Supports Multilingual Teams: Translate and analyze documents for global collaboration and compliance.
- Integration & Scalability: Ready for integration with your analytics or reporting stack.
Pull requests welcome! For major changes, please open an issue first.
- OpenAI, FAISS, Hugging Face, Streamlit
Faisal Budhwani (017627363):
- PDF processing pipeline implementation (pdfplumber, Camelot, PyMuPDF integration)
- Text extraction and preprocessing algorithms
- Document structure recognition and section identification
- OCR integration and image processing workflows
Shantanu Joshi (018173987):
- Semantic search engine development using FAISS
- Sentence transformer integration and vector embedding generation
- Question-answering system implementation with GPT-4
- Search optimization and context management algorithms
Siddharth Kulkarni (018219435):
- EDA pipeline development and automated visualization generation
- Translation system implementation with dual-approach strategy
- Streamlit frontend development and user interface design
- System integration and comprehensive testing
