Semantic Search Engine

A document search application I built for my machine learning class project. Instead of just matching keywords, it understands the meaning of your search using AI embeddings. You can upload different types of documents and search through them semantically.

What it does

This project lets you search through documents based on meaning rather than exact keyword matches. I implemented it using Flask for the web interface and Sentence Transformers for the AI part.

Main features

Search documents by meaning using sentence embeddings
Upload TXT, PDF, and DOCX files
Get snippets from relevant documents with similarity scores
Keep track of search history and favorite documents
Simple web interface that works on mobile too

Technical stuff I learned

How to use pre-trained transformer models for semantic search
Building a Flask web application with proper structure
Handling file uploads and text extraction
Session management and user interface design
Basic security practices for web apps

How I built it

Backend: Flask (Python web framework) AI Model: Sentence Transformers (all-MiniLM-L6-v2 model) File processing: PyPDF2 for PDFs, python-docx for Word docs Frontend: HTML, CSS, and basic JavaScript Data storage: Files on disk, sessions for user data

Project structure

search-engine/
├── app.py                 # Main Flask application
├── utils.py              # Helper functions for document processing
├── static/
│   └── style.css         # Styling for the web interface
├── templates/
│   ├── index.html        # Main search page
│   ├── document.html     # Document viewer
│   └── error.html        # Error page
├── documents/            # Where uploaded documents are stored
│   ├── doc1.txt
│   ├── doc2.txt
│   └── doc3.txt
├── requirements.txt      # Python dependencies
└── README.md

Setting it up

What you need

Python 3.8 or newer
pip for installing packages

Installation steps

Download the code

git clone <repository-url>
cd semantic-search-engine

Set up a virtual environment (recommended)

python -m venv venv

# On Windows
venv\Scripts\activate

# On Mac/Linux
source venv/bin/activate

Install the required packages
```
pip install -r requirements.txt
```

Optional: Set environment variables

# Windows
set SECRET_KEY=your-secret-key-here
set DOCUMENTS_FOLDER=documents

# Mac/Linux
export SECRET_KEY=your-secret-key-here
export DOCUMENTS_FOLDER=documents

Run the application
```
python app.py
```
Open in your browser Go to http://localhost:5000

Configuration options

You can customize the app using environment variables:

Variable	What it does	Default value
`SECRET_KEY`	Secret key for Flask sessions	Auto-generated
`DOCUMENTS_FOLDER`	Where to store uploaded files	`documents`
`FLASK_DEBUG`	Enable debug mode	`False`
`PORT`	Which port to run on	`5000`

How to use it

Basic searching

Type your search query in the search box
Look at the results ranked by how similar they are
Click on document names to read the full content

Uploading files

Click "Choose File" to pick a document
Supports TXT, PDF, and DOCX files (up to 16MB)
Click "Upload" to add it to the search index

Other features

Click the star to favorite documents for easy access later
Your recent searches are saved automatically
Use /api/search?q=your-query for programmatic access

API endpoint

If you want to use this programmatically:

GET /api/search?q=<your query>

Returns JSON like this:

{
  "query": "machine learning",
  "results": [
    {
      "filename": "ml_notes.txt",
      "score": 85.7,
      "snippet": "Machine learning is a subset of artificial intelligence..."
    }
  ]
}

Running for development vs production

For development

export FLASK_DEBUG=true
python app.py

For production deployment

Set production environment

export SECRET_KEY=your-production-secret-key
export FLASK_DEBUG=false
export PORT=8000

Use a proper web server

pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:8000 app:app

What I learned from this project

How semantic search works and why it's better than keyword matching
Working with pre-trained AI models without training them myself
Building a complete web application from scratch
Handling different file formats and text extraction
Making web interfaces that work well on different devices
Basic security practices like input validation and file handling

Challenges I faced

Understanding how sentence transformers work and which model to use
Figuring out efficient ways to handle document embeddings in memory
Making the file upload feature secure and reliable
Getting the CSS to look good on both desktop and mobile
Learning proper Flask application structure

Future improvements

If I continue working on this, I'd like to add:

User accounts and authentication
Better document management (delete, organize)
Support for more file types
Advanced search filters
Document highlighting for matched sections

Dependencies

All the Python packages needed are listed in requirements.txt. The main ones are:

Flask for the web framework
sentence-transformers for the AI model
PyPDF2 and python-docx for file processing
Standard libraries for everything else

Contributing

This is a learning project, but if you find bugs or have suggestions, feel free to open an issue or submit a pull request.

License

This project is open source under the MIT License.

Built as a machine learning class project using Flask and Sentence Transformers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search Engine

What it does

Main features

Technical stuff I learned

How I built it

Project structure

Setting it up

What you need

Installation steps

Configuration options

How to use it

Basic searching

Uploading files

Other features

API endpoint

Running for development vs production

For development

For production deployment

What I learned from this project

Challenges I faced

Future improvements

Dependencies

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
documents		documents
static		static
templates		templates
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Semantic Search Engine

What it does

Main features

Technical stuff I learned

How I built it

Project structure

Setting it up

What you need

Installation steps

Configuration options

How to use it

Basic searching

Uploading files

Other features

API endpoint

Running for development vs production

For development

For production deployment

What I learned from this project

Challenges I faced

Future improvements

Dependencies

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages