Live on: https://learnwords.duckdns.org This is the live production version of the project. You can directly enter English words and generate smart, context-aware quizzes powered by a fine-tuned LLM and semantic search.
The AI model powering this application is not pre-trained out of the box. I personally fine-tuned google/flan-t5-small using the Stanford SQuAD dataset to transform it into a specialized Question Generator.
π View My Training Notebook: Google Colab - Fine-Tuning & Quantization
video.mp4
This project is a full-stack, production-ready Generative AI application designed to teach English vocabulary through dynamic quizzes. Unlike standard flashcard apps, it uses a Retrieval-Augmented Generation (RAG) pipeline to fetch precise definitions from a vector database and a Fine-Tuned Google Flan-T5 model to generate context-aware questions in real-time.
The project is designed as a complete End-to-End AI application, covering:
- Custom Model Training: Transforming a general-purpose LLM into a specialized question generator.
- Edge Optimization: Running heavy NLP tasks on limited CPU resources via ONNX Quantization.
- Data Engineering: Scraping, cleaning, and indexing 10,000+ words into a Vector DB.
- Containerization: Single-container microservice architecture.
- Cloud Deployment: Secure deployment on AWS EC2 with Nginx & SSL.
The main goal of this project is to solve the "Hallucination" problem in LLMs when generating educational content and to build a cost-effective AI product. The focus is not only on the model but on the entire engineering pipeline:
- Resource Optimization: Achieving <500ms inference latency on 1GB RAM (AWS Free Tier) using INT8 Quantization.
- Accuracy: Implementing a hybrid search algorithm (Metadata Filtering + Fuzzy Search) to ensure zero-miss retrieval.
- User Experience: "Sequential Learning" logic where regenerating a quiz fetches a different definition for the same word.
- Production Deployment: Robust Dockerized environment served via HTTPS.
- Task: Text-to-Text Generation (Question Generation)
- Models:
Google Flan-T5 Small(Fine-Tuned)Sentence-Transformers(Embedding Model: all-MiniLM-L6-v2)
- Optimization:
ONNX Runtime&Quantization (INT8) - Database:
Pinecone(Vector Database with Metadata Filtering)
- Language:
Python 3.9 - Framework:
FastAPI(Asynchronous endpoints) - Logic: Custom RAG pipeline with "Smart Masking" (Regex-based answer hiding) and "Dynamic Distractor Generation".
- Environment:
Python-Dotenv(Configuration management)
- Tech:
HTML5,JavaScript (ES6+) - Design:
Tailwind CSS(Modern "Dark Tech" theme) - Interactivity: Bulk word processing, loading states, and interactive feedback.
- Containerization:
Docker(Multi-stage build optimization) - Cloud Provider:
AWS EC2(Ubuntu - Free Tier Optimized) - Web Server:
Nginx(Reverse Proxy) - Security:
Certbot(Let's Encrypt SSL/TLS) - Networking:
DuckDNS(Dynamic DNS Updater)
app/main.pyβ FastAPI entry point. Handlesbulk-generate, sequential logic, and CORS.services/β Core logic for Vector DB connection and Generator inference.models/onnx_quantized/β CUSTOM MODEL FILES (Encoder/Decoder) generated via Colab.
frontend/index.htmlβ Modern landing page for bulk input.quiz.htmlβ Dynamic quiz interface with "Regenerate" capability.
seed_db.pyβ ETL script to fetch 10,000+ words from dictionary APIs and populate Pinecone.Dockerfileβ Optimized image build steps (installing CPU-only PyTorch first)..envβ Environment variables (Pinecone API Keys).
The core "Brain" of this project was not just downloaded; it was engineered. I performed fine-tuning and optimization to make a Small Language Model (SLM) behave like a specialized teacher.
View Training Notebook: Google Colab - Fine-Tuning & Quantization
A pre-trained FLAN-T5 model is naturally instruction-tuned, but it struggles with Question Generation.
- Default Behavior: If you ask it to "Generate a question for Apple," it might say "Apple is a technology company" (answering instead of asking) or "What is apple?" (too simple).
- Fine-Tuning Goal: I needed to force the model to format output specifically for B2-level quizzes. It learned not just to ask, but to construct a question based only on the provided context.
I utilized the Stanford Question Answering Dataset (SQuAD), which consists of 100,000+ questions based on Wikipedia articles. However, I had to reverse the data structure:
- Standard SQuAD:
(Context + Question) -> Answer - My Engineering:
(Context + Answer) -> Question
JSON Structure Example:
{
"context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL)...",
"question": "Which NFL team won Super Bowl 50?",
"answers": [ { "text": "Denver Broncos", "answer_start": 177 } ]
}I processed the data to feed the model a specific prompt template during training:
- Input:
answer: Denver Broncos context: Super Bowl 50 was an American football game... - Target Output:
Which NFL team won Super Bowl 50?
This forces the model to learn the relationship between a specific answer and its surrounding context to derive a question.
To deploy this on a free AWS instance (1 vCPU, 1GB RAM), standard PyTorch weights were too heavy.
- Conversion: The fine-tuned model was exported to ONNX format.
- Quantization: I applied INT8 Quantization using the
optimumlibrary. - Result: Reduced model size by 4x and improved CPU inference speed by ~20x.
This project relies on a comprehensive Vector Database to act as the "Memory" of the AI.
I scraped and curated data from high-quality sources to ensure definitions are accurate and modern:
-
Word List: MIT 10,000 Most Frequent Words
-
Definitions: Free Dictionary API (Sourced from Oxford/Google)
-
Vector Database Content: Engineered a semantic search engine containing over 7,100+ curated English words (sourced from Oxford/Google Dictionary data), utilizing Metadata Filtering for exact matches and Fuzzy Search for fallbacks.
The data is stored in Pinecone with the following JSON structure, allowing for both semantic search and metadata filtering:
{
"id": "word_apple",
"vector": [0.12, 0.54, ...], // Embedding for semantic search
"metadata": {
"word": "Apple",
"definition": "A round fruit with red or green skin and a whitish inside.",
"synonyms": ["fruit", "red pome"],
"example_sentence": "She eats an apple every day to stay healthy.",
"difficulty": "A1"
}
}When a user requests a quiz for a word (e.g., "Apple"), the system executes the following Runtime Flow:
- Retrieval: The code queries the Vector DB (Pinecone). It first attempts a Metadata Filter (
word='Apple') for precision, falling back to Vector Search if needed. - Prompt Engineering: The retrieved definition is injected into a specific template.
- System Prompt:
generate question: answer: Apple context: [Retrieved Definition]
- System Prompt:
- Generation: The ONNX-optimized T5 model processes this prompt and generates a context-aware question.
- Post-Processing: The answer ("Apple") is masked in the generated question (replaced with
_______) to create a fill-in-the-blank style quiz.
You can run the system in two different ways:
- Using Docker (recommended for consistency)
- Running manually with Python
You need a Pinecone API Key (Free Tier). Create a file named .env in the project root:
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_ENV=your_region
INDEX_NAME=your_index_nameThis ensures all dependencies (including the specific ONNX runtime and CPU-only PyTorch) are correct.
Build the image:
docker build -t ai-quiz-app .Run the container:
docker run -p 8080:80 --env-file .env ai-quiz-appOpen your browser:
http://localhost:8080
pip install -r requirements.txtuvicorn app.main:app --reloadFrontend runs at: http://127.0.0.1:8000
- Deployed on:
AWS EC2 (t2.micro)running Ubuntu. - Optimization: Configured with 4GB Swap Space to prevent OOM (Out of Memory) kills during model loading.
- Docker Optimization: Uses
torch --index-url .../cputo reduce image size by removing CUDA dependencies. - Access: Served via
DuckDNSwith Port 80 redirection. - Security: Traffic secured with SSL/TLS certificates via Let's Encrypt (Certbot) and Nginx Reverse Proxy.
This project demonstrates the capability to build cost-effective, high-performance AI applications. It moves beyond simple API wrappers by implementing:
- Custom Model Fine-Tuning & Quantization.
- Hybrid Search Algorithms (Keyword + Vector).
- Full-Stack Microservice Architecture.
- Real-world Cloud Deployment.