NLP Projects – Visual Infomedia

Internship Project by Shruti Sivakumar
Duration: 6 Weeks | Domain: NLP, Applied Machine Learning
Organization: Visual Infomedia

🔍 Overview

This repository contains two NLP automation tools developed during a 6-week internship. The solutions address document parsing and keyword classification tasks and are deployed in a Streamlit web app.

🧠 Projects

📄 1. Automated Data Extraction from Business Memos

Fine-tuned a GPT-4o-mini model to convert unstructured memo text into structured fields (e.g., title, agency, deadline). Includes dual-model versioning, schema validation, and Excel I/O.

Models: GPT-4o-mini (v1, v2)
Validation: Pydantic schemas
Accuracy: 94.2% field extraction, 99.1% schema compliance

🏷️ 2. Multi-Label Keyword Classification System

Trained a DistilRoBERTa model to tag 95,000+ bid titles with relevant keywords from a 4,000+ class vocabulary. Integrated confidence scoring and Excel-based output.

Architecture: DistilRoBERTa
Metrics: Hamming Loss: 0.0004 · F1-Score (Micro): 0.356
Inference: Color-coded prediction with fallback logic

💻 Application Architecture

Multi-page Streamlit app with:

Upload → Process → Download flow
Real-time progress tracking
Error handling and Excel export

⚠️ This repo contains a demonstration version. Proprietary model weights and business data are excluded.

📁 Project Structure

summer-internship-25/
├── main.py
├── pages/
├── model_artifacts/
├── memo_parser.py
├── keyword_tagger.py
├── requirements.txt
├── .streamlit/secrets
└── README.md

⚙️ Local Setup

git clone https://github.com/shruti-sivakumar/summer-internship-25.git
cd summer-internship-25
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate
pip install -r requirements.txt
streamlit run main.py

Edit .streamlit/secrets to include API keys.

🧰 Tech Stack

Python · Streamlit · OpenAI API · HuggingFace Transformers · Pydantic · Scikit-learn · PyTorch · OpenPyXL

🛡️ Legal & Ethical Notes

This repo is a sanitized version of the internship deliverables.
All sensitive data and proprietary models are excluded.
Commercial use requires separate licensing from Visual Infomedia.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Projects – Visual Infomedia

🔍 Overview

🧠 Projects

📄 1. Automated Data Extraction from Business Memos

🏷️ 2. Multi-Label Keyword Classification System

💻 Application Architecture

📁 Project Structure

⚙️ Local Setup

🧰 Tech Stack

🛡️ Legal & Ethical Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pages		pages
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
keyword_tagger.py		keyword_tagger.py
main.py		main.py
memo_parser.py		memo_parser.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NLP Projects – Visual Infomedia

🔍 Overview

🧠 Projects

📄 1. Automated Data Extraction from Business Memos

🏷️ 2. Multi-Label Keyword Classification System

💻 Application Architecture

📁 Project Structure

⚙️ Local Setup

🧰 Tech Stack

🛡️ Legal & Ethical Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages