A local-first AI-powered tool to search, summarize, and understand your notes β built for students, by a student.
Uses chunking, vector embeddings, and LLaMA 3 via Ollama for real semantic understanding.
- Index PDFs, Word, Excel, and PowerPoint files into semantic chunks
- Embed and store those chunks using vector search
- Ask questions using LLaMA 3 via Ollama (offline + local)
- Get answers with real context grounding (like the ChatGPT Retrieval Plugin)
-
- Follow-up question support with persistent memory
-
- Ingest multiple files at once from a folder
-
- Test-ready with
test files/folder
- Test-ready with
noteweb/
βββ main.py # CLI entrypoint
βββ search.py # Embedding search logic
βββ llm_answerer.py # Sends chunks to LLaMA via Ollama
βββ embedder.py # Generates embeddings
βββ chunker.py # Breaks files into semantic chunks
βββ files_loader.py # PDF loader (more formats soon)
βββ generate_index.py # Index generator (embeds + saves)
βββ embeddings_index.json # Your saved vector index
βββ test files/ # Sample PDFs to test with
βββ requirements.txt
βββ venv/ # Your virtual environment
- Python 3.11+
- Ollama installed and running
brew install ollama ollama run llama3
- Install dependencies:
pip install -r requirements.txt
git clone https://github.com/marcanjoul/noteweb.git
cd notewebOr, if you downloaded the ZIP, unzip it and navigate into the folder via:
cd ~/Desktop/noteweb-main # or wherever you saved itWe recommend using a virtual environment to keep dependencies clean:
python3 -m venv venv
source venv/bin/activate # (for mac) # On Windows, use venv\Scripts\activateRun the following to install all required packages:
pip install -r requirements.txtIf needed, manually install these extras:
pip install python-docx python-pptx openpyxl sentence-transformersCreate or drop any files you want to search into the test files/ directory. Supported formats: .pdf .docx .pptx .xlsx
python generate_index.pyThis will:
- Load all .pdf, .docx, .xslx, and .pptx files from your test files/ folder
- Chunk the content into semantically meaningful parts
- Embed each chunk using sentence-transformers
- Save everything to embeddings_index.json
python search.pyThis will:
-
Search your indexed chunks for relevant context
-
Pass top matches to LLaMA 3
-
Return an answer based on your notes
-
You can also ask follow-up questions, as NoteWeb remembers the context!
π‘ Example Usage
What is the difference between supervised and unsupervised learning?
βRequirements Make sure you have:
- Python 3.9+
- pip
- Optional: Ollama installed and running (for local LLaMA 3 support)
π‘ Optional: Skip venv (Not Recommended) You can also run NoteWeb without using a virtual environment:
pip install -r requirements.txt pip install python-docx python-pptx openpyxl sentence-transformers
python main.py --search "What is instruction-level parallelism?"NoteWeb simulates real retrieval-augmented generation (RAG) β the same strategy used in:
- ChatGPT w/ File Uploads
- Perplexity AI
- Open-source RAG pipelines (like LangChain, LlamaIndex)
But here, itβs all:
- Local
- Educational
- Hackable
Perfect for learning how vector search + LLMs work together.
- β PDF
-
- β DOCX (.docx)
-
- β PowerPoint (.pptx)
-
- β Excel (.xlsx)
-
- π TXT, Markdown, Web scraping
You can drop files into the test files/ folder!
- Multi-file indexing (entire folders at once)
- DOCX, PPTX, and XLSX support
- Follow-up question support
- Optional toggle in code for chunk visibility
- Index caching to skip re-embedding unchanged files
- Command-line UI / TUI
- Web UI
NoteWeb is my first AI-integrated project β made while learning:
- How LLMs like LLaMA work
- What βsemantic searchβ really means
- How chunking, embeddings, and vector stores come together
This is the foundation for bigger projects β search tools, academic companions, even personalized AI.
- Built with π» and β by @marcanjoul
- PDF parsing via
PyMuPDF - DOCX parsing via
python-docx - PowerPoint parsing via
python-pptx - Excel parsing via
openpyxl - Embeddings via
sentence-transformers - LLM answers via Ollama and Metaβs LLaMA 3
Open an issue, drop a PR, or fork it and make it your own.I'd appreciate any feedback!