Skip to content

NoteWeb is a local-first AI tool that semantically indexes and searches your documents using LLaMA 3 and vector embeddings.

Notifications You must be signed in to change notification settings

marcanjoul/NoteWeb

Repository files navigation

🧠 NoteWeb

Python Ollama Semantic Search File Support Status First AI Project

A local-first AI-powered tool to search, summarize, and understand your notes β€” built for students, by a student.
Uses chunking, vector embeddings, and LLaMA 3 via Ollama for real semantic understanding.


Features

  • Index PDFs, Word, Excel, and PowerPoint files into semantic chunks
  • Embed and store those chunks using vector search
  • Ask questions using LLaMA 3 via Ollama (offline + local)
  • Get answers with real context grounding (like the ChatGPT Retrieval Plugin)
    • Follow-up question support with persistent memory
    • Ingest multiple files at once from a folder
    • Test-ready with test files/ folder

Folder Structure

noteweb/
β”œβ”€β”€ main.py                 # CLI entrypoint
β”œβ”€β”€ search.py              # Embedding search logic
β”œβ”€β”€ llm_answerer.py        # Sends chunks to LLaMA via Ollama
β”œβ”€β”€ embedder.py            # Generates embeddings
β”œβ”€β”€ chunker.py             # Breaks files into semantic chunks
β”œβ”€β”€ files_loader.py        # PDF loader (more formats soon)
β”œβ”€β”€ generate_index.py      # Index generator (embeds + saves)
β”œβ”€β”€ embeddings_index.json  # Your saved vector index
β”œβ”€β”€ test files/            # Sample PDFs to test with
β”œβ”€β”€ requirements.txt
└── venv/                  # Your virtual environment

Requirements

  • Python 3.11+
  • Ollama installed and running
    brew install ollama
    ollama run llama3
  • Install dependencies:
    pip install -r requirements.txt

How to Use

1. Clone the Repository

git clone https://github.com/marcanjoul/noteweb.git
cd noteweb

Or, if you downloaded the ZIP, unzip it and navigate into the folder via:

cd ~/Desktop/noteweb-main  # or wherever you saved it

2. Set Up a Virtual Environment

We recommend using a virtual environment to keep dependencies clean:

python3 -m venv venv
source venv/bin/activate  # (for mac) # On Windows, use venv\Scripts\activate

3. Install Dependencies

Run the following to install all required packages:

pip install -r requirements.txt

If needed, manually install these extras:

pip install python-docx python-pptx openpyxl sentence-transformers

4. Add Your Files

Create or drop any files you want to search into the test files/ directory. Supported formats: .pdf .docx .pptx .xlsx

5. Run this to generate semantic embeddings from files in your folder

python generate_index.py

This will:

  • Load all .pdf, .docx, .xslx, and .pptx files from your test files/ folder
  • Chunk the content into semantically meaningful parts
  • Embed each chunk using sentence-transformers
  • Save everything to embeddings_index.json

6. Search Your Files with the AI

python search.py

This will:

  • Search your indexed chunks for relevant context

  • Pass top matches to LLaMA 3

  • Return an answer based on your notes

  • You can also ask follow-up questions, as NoteWeb remembers the context!

    πŸ’‘ Example Usage

    What is the difference between supervised and unsupervised learning?

    ❗Requirements Make sure you have:

    • Python 3.9+
    • pip
    • Optional: Ollama installed and running (for local LLaMA 3 support)

    πŸ’‘ Optional: Skip venv (Not Recommended) You can also run NoteWeb without using a virtual environment:

    pip install -r requirements.txt
    pip install python-docx python-pptx openpyxl sentence-transformers

7. Ask a question

python main.py --search "What is instruction-level parallelism?"

Why This Matters

NoteWeb simulates real retrieval-augmented generation (RAG) β€” the same strategy used in:

  • ChatGPT w/ File Uploads
  • Perplexity AI
  • Open-source RAG pipelines (like LangChain, LlamaIndex)

But here, it’s all:

  • Local
  • Educational
  • Hackable

Perfect for learning how vector search + LLMs work together.


File Support

  • βœ… PDF
    • βœ… DOCX (.docx)
    • βœ… PowerPoint (.pptx)
    • βœ… Excel (.xlsx)
    • πŸ”œ TXT, Markdown, Web scraping

You can drop files into the test files/ folder!


Roadmap

  • Multi-file indexing (entire folders at once)
  • DOCX, PPTX, and XLSX support
  • Follow-up question support
  • Optional toggle in code for chunk visibility
  • Index caching to skip re-embedding unchanged files
  • Command-line UI / TUI
  • Web UI

Project Status

NoteWeb is my first AI-integrated project β€” made while learning:

  • How LLMs like LLaMA work
  • What β€œsemantic search” really means
  • How chunking, embeddings, and vector stores come together

This is the foundation for bigger projects β€” search tools, academic companions, even personalized AI.


Credits


Want to improve or collaborate?

Open an issue, drop a PR, or fork it and make it your own.I'd appreciate any feedback!

Releases

No releases published

Packages

No packages published

Languages