A document search application I built for my machine learning class project. Instead of just matching keywords, it understands the meaning of your search using AI embeddings. You can upload different types of documents and search through them semantically.
This project lets you search through documents based on meaning rather than exact keyword matches. I implemented it using Flask for the web interface and Sentence Transformers for the AI part.
- Search documents by meaning using sentence embeddings
- Upload TXT, PDF, and DOCX files
- Get snippets from relevant documents with similarity scores
- Keep track of search history and favorite documents
- Simple web interface that works on mobile too
- How to use pre-trained transformer models for semantic search
- Building a Flask web application with proper structure
- Handling file uploads and text extraction
- Session management and user interface design
- Basic security practices for web apps
Backend: Flask (Python web framework) AI Model: Sentence Transformers (all-MiniLM-L6-v2 model) File processing: PyPDF2 for PDFs, python-docx for Word docs Frontend: HTML, CSS, and basic JavaScript Data storage: Files on disk, sessions for user data
search-engine/
├── app.py # Main Flask application
├── utils.py # Helper functions for document processing
├── static/
│ └── style.css # Styling for the web interface
├── templates/
│ ├── index.html # Main search page
│ ├── document.html # Document viewer
│ └── error.html # Error page
├── documents/ # Where uploaded documents are stored
│ ├── doc1.txt
│ ├── doc2.txt
│ └── doc3.txt
├── requirements.txt # Python dependencies
└── README.md
- Python 3.8 or newer
- pip for installing packages
-
Download the code
git clone <repository-url> cd semantic-search-engine
-
Set up a virtual environment (recommended)
python -m venv venv # On Windows venv\Scripts\activate # On Mac/Linux source venv/bin/activate
-
Install the required packages
pip install -r requirements.txt
-
Optional: Set environment variables
# Windows set SECRET_KEY=your-secret-key-here set DOCUMENTS_FOLDER=documents # Mac/Linux export SECRET_KEY=your-secret-key-here export DOCUMENTS_FOLDER=documents
-
Run the application
python app.py
-
Open in your browser Go to
http://localhost:5000
You can customize the app using environment variables:
| Variable | What it does | Default value |
|---|---|---|
SECRET_KEY |
Secret key for Flask sessions | Auto-generated |
DOCUMENTS_FOLDER |
Where to store uploaded files | documents |
FLASK_DEBUG |
Enable debug mode | False |
PORT |
Which port to run on | 5000 |
- Type your search query in the search box
- Look at the results ranked by how similar they are
- Click on document names to read the full content
- Click "Choose File" to pick a document
- Supports TXT, PDF, and DOCX files (up to 16MB)
- Click "Upload" to add it to the search index
- Click the star to favorite documents for easy access later
- Your recent searches are saved automatically
- Use
/api/search?q=your-queryfor programmatic access
If you want to use this programmatically:
GET /api/search?q=<your query>
Returns JSON like this:
{
"query": "machine learning",
"results": [
{
"filename": "ml_notes.txt",
"score": 85.7,
"snippet": "Machine learning is a subset of artificial intelligence..."
}
]
}export FLASK_DEBUG=true
python app.py-
Set production environment
export SECRET_KEY=your-production-secret-key export FLASK_DEBUG=false export PORT=8000
-
Use a proper web server
pip install gunicorn gunicorn -w 4 -b 0.0.0.0:8000 app:app
- How semantic search works and why it's better than keyword matching
- Working with pre-trained AI models without training them myself
- Building a complete web application from scratch
- Handling different file formats and text extraction
- Making web interfaces that work well on different devices
- Basic security practices like input validation and file handling
- Understanding how sentence transformers work and which model to use
- Figuring out efficient ways to handle document embeddings in memory
- Making the file upload feature secure and reliable
- Getting the CSS to look good on both desktop and mobile
- Learning proper Flask application structure
If I continue working on this, I'd like to add:
- User accounts and authentication
- Better document management (delete, organize)
- Support for more file types
- Advanced search filters
- Document highlighting for matched sections
All the Python packages needed are listed in requirements.txt. The main ones are:
- Flask for the web framework
- sentence-transformers for the AI model
- PyPDF2 and python-docx for file processing
- Standard libraries for everything else
This is a learning project, but if you find bugs or have suggestions, feel free to open an issue or submit a pull request.
This project is open source under the MIT License.
Built as a machine learning class project using Flask and Sentence Transformers