A production-ready Named Entity Recognition (NER) system powered by BERT Transformers and Bi-LSTM-CRF architectures.
This project demonstrates an end-to-end MLOps pipeline for extracting structured insights from unstructured text, tailored for high-impact business use cases in Finance, Legal, and Healthcare.
- Dual Architecture Support:
- Transformer (BERT): State-of-the-art accuracy, handling out-of-vocabulary (OOV) words and deep context.
- Bi-LSTM-CRF: Efficient, custom-implemented sequence labeling for resource-constrained environments.
- Interactive Dashboard: A professional Streamlit app for real-time inference and visualization.
- Industry Solutions: Pre-configured modules for:
- π° Finance: Extracting tickers, companies, and executives from news.
- βοΈ Legal: Identifying parties and jurisdictions in contracts.
- π₯ Healthcare: De-identifying patient records (HIPAA compliance).
- π₯ HR: Extracting skills and qualifications from resumes.
- Confidence Scoring: Probabilistic outputs for risk-adjusted decision making.
graph TD
A[Unstructured Text] --> B{Model Selector}
subgraph "Deep Learning Path"
B -->|Bi-LSTM-CRF| C[Word Embeddings]
C --> D[Bi-Directional LSTM]
D --> E[CRF Layer]
E --> F[Sequence Tags]
end
subgraph "Transformer Path"
B -->|BERT| G[Tokenizer]
G --> H[BERT-Base]
H --> I[Token Classification Head]
I --> F
end
F --> J[Streamlit Dashboard]
J --> K[Visualizations & Analytics]
-
Clone the repository
git clone https://github.com/victoropp/enterprise-ner-intelligence.git cd enterprise-ner-intelligence -
Download Model Files
The trained model files are tracked with Git LFS. After cloning, ensure Git LFS is installed:
git lfs install git lfs pull
-
Install Dependencies
pip install -r requirements.txt
-
Run the Application
streamlit run deployment/app.py
| Model | Precision | Recall | F1-Score |
|---|---|---|---|
| BERT-Base | 0.91 | 0.93 | 0.92 |
| Bi-LSTM-CRF | 0.67 | 0.64 | 0.65 |
Note: BERT significantly outperforms the traditional Bi-LSTM approach, especially on unseen entities, demonstrating the power of Transfer Learning.
βββ data/ # CoNLL-2003 Dataset
βββ deployment/ # Streamlit Application
β βββ app.py # Main Dashboard
βββ models/ # Saved Models & Checkpoints
β βββ ner_model.h5 # Trained Bi-LSTM-CRF model
β βββ word2idx.pkl # Vocabulary mappings
β βββ tag2idx.pkl # Tag mappings
βββ src/ # Source Code
β βββ train_bert.py # BERT Fine-tuning Script
β βββ train.py # Bi-LSTM Training Script
β βββ model.py # Bi-LSTM Architecture
β βββ crf.py # Custom CRF Layer (TensorFlow)
β βββ data_loader.py # Data preprocessing utilities
β βββ evaluate.py # Evaluation Metrics
βββ notebooks/ # Jupyter notebooks for exploration
βββ tests/ # Unit tests
βββ requirements.txt # Dependencies
Automatically scan thousands of earnings call transcripts to extract:
- Organizations: Competitors, partners, subsidiaries.
- Persons: Key executives, analysts.
- Locations: Emerging markets, factory locations.
Automate contract review by extracting:
- Parties: "Alpha Corp" vs "Beta Ltd".
- Jurisdictions: "State of Delaware", "London".
De-identify medical records by detecting and masking:
- Patient Names: HIPAA compliance
- Locations: Hospital names, addresses
- Organizations: Healthcare providers
Extract structured information from resumes:
- Skills: Programming languages, certifications
- Organizations: Previous employers
- Locations: Work locations, willingness to relocate
This application is ready to deploy on Streamlit Cloud:
- Fork this repository to your GitHub account
- Go to share.streamlit.io
- Click "New app"
- Select your repository, branch, and
deployment/app.py - Click "Deploy"
See deployment/README.md for detailed deployment instructions.
python src/train.py --epochs 50 --batch-size 32python src/train_bert.py --model bert-base-cased --epochs 3Evaluate model performance on the CoNLL-2003 test set:
python src/evaluate.py --model models/ner_model.h5Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Victor Collins Oppon NLP Engineer | Data Scientist | FCCA, MBA, BSc
Portfolio website coming soon!
- CoNLL-2003 Dataset: Erik F. Tjong Kim Sang and Fien De Meulder
- HuggingFace Transformers: For the excellent BERT implementation
- Streamlit: For the intuitive web framework
Built with β€οΈ for the NLP Community