GitHub - moyrsd/Axiom: Axiom is a general-purpose document question-answering system that allows users to upload various file types, including PDF, DOCX, PPTX, XLSX, PNG, JPG, CSV, JSON, and TXT.

View Demo · Report Bug · Request Feature

About The Project:

The General-Purpose Document Q&A System is an AI-powered tool designed to extract and answer questions from diverse document formats. It supports PDFs, DOCX, PPTX, XLSX, CSV, JSON, TXT, and images (PNG/JPG) via OCR, and integrates semantic search, structured data handling, and external link crawling for comprehensive knowledge retrieval.

✨ Features

📄 Multi-format Document Processing: Handles PDFs, PowerPoint presentations, Word documents, Images, csv and many more ...
🤖 Data Analytics : Answer SQL queries like finding mean, median, mode , then finding the projections
🔍 Advanced OCR: Extract text from images embedded within documents
🔎 Semantic Search: Find information based on meaning, not just keywords
📊 Table & Image Recognition: Process structured data and visual element
📝 Source Attribution: Responses include references to source materials

📜 Getting Started

⚙️ Installation

Clone the repo

git clone https://github.com/moyrsd/Axiom.git

Setup Frontend
```
cd root/client
npm install
npm run dev
```
Add your backend url in the .env for eg see the .env.example file

Setup Backend

sudo apt install sqlite3
sudo apt install poppler-utils
cd root/server
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Add Google Api key in .env file (See .env.example for reference)

 uvicorn main:app --reload

Use the application now go to http://localhost:3000/ to use the application. Upload any pdf file and ask question

Code Architecture

root/
├── client/
│   ├── src/
│   │   ├── app/
│   │   │   ├── actions/
│   │   │   │   └── delete_tempfile.ts      # Deletes temporary files (server-side)
│   │   │   └── page.tsx                    # Main page component
│   │   ├── components/
│   │   │   ├── Filesidebar.tsx             # Sidebar for uploaded files
│   │   │   ├── MessageList.tsx             # Displays message responses
│   │   │   └── PromptBox.tsx               # User input box
│   │   ├── interface/
│   │   │   └── Interface.ts                # TypeScript interfaces
│   │   └── pages/api/
│   │       ├── ask.ts                      # API for querying
│   │       └── upload.ts                   # API for uploads
└── server/
    ├── main.py                             # Server entry point
    ├── requirements.txt                    # Python dependencies
    └── src/
        ├── config/
        │   └── constant.py                 # Server constants
        ├── document_processing/
        │   ├── image_parser.py             # Image file processing
        │   ├── pdf_parser.py               # PDF file processing
        │   ├── ppt_parser.py               # PPT file processing
        │   ├── structured_data_parser.py   # Structured data processing
        │   ├── text_parser.py              # Text file processing
        │   └── web_crawl.py                # Web crawling
        ├── prompts/
        │   ├── beautify_prompt.py          # Beautifies user prompts
        │   ├── dataprocessing_prompt.py    # Prepares data processing prompts 
        │   └── ocr_prompt.py               # OCR-related prompts
        ├── routers/
        │   ├── ask.py                      # Routes for querying
        │   ├── delete_tempfiles.py         # Routes for temp file deletion
        │   └── upload.py                   # Routes for file uploads
        └── services/
            ├── convert_to_json.py          # Converts to JSON
            ├── file_processor.py           # File processing workflows
            ├── llm_calls.py                # Interacts with LLMs
            ├── qa_chain.py                 # QA pipeline implementation
            └── vector_store.py             # Vector storage

🛠️ Technologies Used

Backend

FastAPI: High-performance Python framework optimized for building APIs with automatic documentation
LangChain: Framework for developing applications powered by language models
ChromaDB: Vector database for efficient similarity search and metadata storage

AI & ML Components

Gemini-2.0-Flash: Multimodal LLM for text generation and image understanding
Vector Embeddings: Semantic representation of document content for intelligent retrieval

Document Processing

PyMuPDF: Versatile library for parsing PDFs and extracting complex elements
python-pptx: Advanced toolkit for PowerPoint presentation analysis
BeautifulSoup & Requests: Web content extraction and formatting tools

Frontend

Next.js: React framework for building a responsive and dynamic user interface

🚀 Processing Pipeline

1. Document Upload & Processing

When a user uploads a document, the following process occurs:

The client sends the document to the /upload API endpoint.Document references are added to the user's sidebar via Filesidebar.tsx.
The server's upload.py router receives the file and initiates processing.
Based on the file type, the appropriate parser is selected:
- pdf_parser.py for PDF documents
- ppt_parser.py for PowerPoint presentations
- text_parser.py for plain text files
- image_parser.py for images requiring OCR
- structured_data_parser.py for data files (CSV, Excel, etc.)
- web_crawl.py for processing web content

The file_processor.py orchestrates the document processing workflow:

Text extraction
Chunking content appropriately with source metadata

2. Knowledge Indexing

After processing:

The extracted content vector_store.py generates embeddings and stores them in ChromaDB along with metadata.

3. Query & Response Generation

When a user submits a question:

The query is sent from PromptBox.tsx to the /ask API endpoint.
ask.py router processes the request and directs it to the appropriate service.
1. qa_chain.py implements the RAG pipeline:
- Converts the query to a vector representation.
- Retrieves relevant document chunks from ChromaDB.
- Formats context and query for the LLM.
- Processes the response to include source attribution.
beautify_prompt.py refines the user's query for optimal retrieval.

The response is returned to the client, where MessageList.tsx displays it to the user in proper markdown format

4. Cleanup & Maintenance

Temporary files are managed through delete_tempfiles.py on the server.
Client-side cleanup is handled by delete_tempfile.ts server action.

🧑‍💻 Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue. Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feat/AmazingFeature)
Commit your Changes (git commit -m 'feat: adds some amazing feature')
Push to the Branch (git push origin feat/AmazingFeature)
Open a Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Assets		Assets
root		root
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About The Project:

✨ Features

📜 Getting Started

⚙️ Installation

Code Architecture

🛠️ Technologies Used

Backend

AI & ML Components

Document Processing

Frontend

🚀 Processing Pipeline

1. Document Upload & Processing

2. Knowledge Indexing

3. Query & Response Generation

4. Cleanup & Maintenance

🧑‍💻 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

moyrsd/Axiom

Folders and files

Latest commit

History

Repository files navigation

About The Project:

✨ Features

📜 Getting Started

⚙️ Installation

Code Architecture

🛠️ Technologies Used

Backend

AI & ML Components

Document Processing

Frontend

🚀 Processing Pipeline

1. Document Upload & Processing

2. Knowledge Indexing

3. Query & Response Generation

4. Cleanup & Maintenance

🧑‍💻 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages