A professional demo web application showcasing a Large Language Model (LLM) deployed with FastAPI.
Designed to run entirely on CPU (Windows-compatible) — this project demonstrates end-to-end AI deployment, from backend to frontend.
🚀 Powered by
google/flan-t5-large, a powerful instruction-tuned model for high-quality, natural language responses.
✅ Web Interface — Simple HTML form for real-time LLM input/output.
✅ Instruction-Tuned Responses — Generates clear, context-aware answers.
✅ Beam Search (num_beams=3) — Deterministic, accurate text generation.
✅ CPU-Only Operation — Runs locally on Windows, no GPU required.
✅ Error Handling — Graceful handling of invalid or long prompts.
✅ Extendable — Easy to adapt for RAG, summarization, or chatbots.
| Component | Description |
|---|---|
| 🐍 Python 3.11+ | Core language |
| ⚡ FastAPI | Backend framework |
| 🔥 Uvicorn | ASGI web server |
| 🧩 Jinja2 | HTML templating |
| 🧠 Transformers + PyTorch | LLM inference |
| 💻 HTML/CSS | Frontend interface |
# 1️⃣ Clone the repository
git clone https://github.com/yourusername/llm-web-app.git
cd llm-web-app
# 2️⃣ Create a virtual environment
python -m venv venv
.\venv\Scripts\activate # (Windows)
# 3️⃣ Install dependencies
pip install -r requirements.txt
## 🚀 Usage
1. Start the FastAPI app:
'''bash
uvicorn app:app --reload
2. Open your browser: http://127.0.0.1:8000
3. Enter a prompt in the web form and submit.
4. View the LLM response below the form.
💬 Example Prompts
* "Explain the difference between supervised and unsupervised learning."
* "Write a Python function to calculate the Fibonacci sequence recursively."
* "Summarize the following text in 2 sentences:..."
* "Provide 3 tips for preparing for a technical interview."
Tip: Frame prompts clearly and include instructions for better responses.
Project Structure
llm-web-app/
│
├─ app.py # FastAPI backend with LLM integration
├─ templates/
│ └─ index.html # HTML frontend
├─ requirements.txt # Python dependencies
└─ README.md # Project documentation
📝 Notes
⚙️ First-time model load may take 5–15 seconds on CPU.
⚡ Inference speed depends on your CPU and max_new_tokens setting.
🚀 For faster responses, reduce max_new_tokens in generate_response().
🌱 Future Improvements
⏳ Add a loading spinner while the LLM generates responses.
🧠 Integrate RAG (Retrieval-Augmented Generation) for domain knowledge.
🐳 Deploy via Docker for cross-platform portability.
💬 Enable multi-user chat sessions with memory.
⚖️ License
This project is licensed under the MIT License — free to use, modify, and share with attribution.
👩💻 Author
Sherry Courington
AI Engineer | MLOps | Computer Vision | LLM Applications
📫 Connect on [LinkedIn](https://www.linkedin.com/in/scourington1/)
