Approaching KLUE baseline performance through architectural engineering: understanding what makes Korean NER work.
# Clone repository
git clone https://github.com/yourusername/korean_ner.git
cd korean_ner
# Install dependencies
pip install -r requirements.txt
# Download model weights from HuggingFace
pip install huggingface_hub
huggingface-cli download mrleast/koelectra_ner --local-dir weights/
# Run the demo
python app.pyNavigate to http://localhost:7860 to use the interactive NER demo.
Named Entity Recognition extracts structured information from text—identifying who, where, when, and what in unstructured content. While English NER benefits from mature benchmarks and abundant annotated data (e.g., CoNLL-2003), Korean NER presents unique linguistic and computational challenges.
1. Agglutinative Morphology
Korean grammatical particles attach directly to noun stems, blurring entity boundaries:
- "서울에서" (in Seoul) = "서울" (Seoul entity) + "에서" (locative particle)
Unlike English where whitespace typically delimits entities ("in Seoul"), Korean requires morpheme-aware segmentation. The model must learn to identify entity spans despite particles that attach without word boundaries.
2. Character-Level Tokenization
The KLUE benchmark uses character-level annotation where each character—including spaces—is a token:
Input: "김민수 교수는"
Tokens: ['김', '민', '수', ' ', '교', '수', '는']
Labels: [B-PS, I-PS, I-PS, O, O, O, O]
This creates three challenges:
- Span aggregation: Entities span multiple tokens without clear delimiters
- OOV generalization: Out-of-vocabulary names must be recognized character-by-character
- Semantic composition: The model must compose meaning across character sequences
3. Structural Ambiguity
Korean location and organization names share morphological patterns:
- "서울시" (Seoul City) → Location: geographic entity OR Organization: metropolitan government
Disambiguation requires contextual understanding beyond surface forms.
This project investigates what architectural and training choices enable competitive Korean NER performance. The central questions are:
- Architecture: How do structured prediction layers (BiLSTM, CRF) complement transformer representations for sequence labeling?
- Regularization: What is the contribution of modern techniques (R-Drop, adversarial training) to Korean NER?
- Engineering: Can systematic debugging and preprocessing validation achieve near-baseline results despite hardware constraints?
Performance Target: The KLUE baseline (KoELECTRA-base) achieves 86.11% F1. This implementation reaches 85.90% F1 (99.8% of baseline) using:
- Architecture: KoELECTRA-base + BiLSTM + CRF
- Training: FGM adversarial training + R-Drop regularization + mixed precision
- Hardware: NVIDIA RTX 3050 (4GB VRAM)
Next Steps: An ablation study framework is implemented to systematically quantify each technique's contribution. Running this study will provide empirical answers about how much each component (BiLSTM, CRF, FGM, R-Drop) contributes to the final performance.
korean_ner/
├── app.py # Gradio web demo with NER inference and attention visualization
├── train_model.ipynb # Complete training notebook with all SOTA techniques
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
├── README.md # This file
│
├── weights/ # Model weights (download from HuggingFace)
│ └── best_model.pt # Final trained KoELECTRA-BiLSTM-CRF model
│
├── evaluate/ # Evaluation and analysis tools
│ ├── evaluate.py # Core evaluation metrics and confusion matrices
│ ├── error_analysis.py # Error categorization (boundary, spurious, missed, type)
│ ├── attention_viz.py # Transformer attention pattern visualization
│ ├── benchmark.py # GPU/CPU inference speed benchmarking
│ └── calibration.py # Confidence calibration analysis (ECE, reliability)
│
├── ablation/ # Ablation study for technique contribution analysis
│ └── train.py # Resumable training script for 6 experimental variants
│
└── assets/ # Generated visualizations and metrics
├── per_entity_f1.png # Per-entity F1 score bar chart
├── confusion_matrix.png # Token-level confusion matrix
├── metrics.json # Quantitative results in JSON format
└── attention/ # Attention heatmaps and layer evolution plots
Model weights are hosted on HuggingFace Hub and must be downloaded before running inference:
# Install HuggingFace CLI (if not already installed)
pip install huggingface_hub
# Download weights to the weights/ directory
huggingface-cli download mrleast/koelectra_ner --local-dir weights/Alternatively, download manually from: https://huggingface.co/mrleast/koelectra_ner
The choice of KoELECTRA-base-v3 as the encoder was motivated by three factors:
1. Korean-Specific Pretraining
Unlike multilingual models (e.g., mBERT), KoELECTRA was pretrained exclusively on Korean corpora:
- 54GB of Korean text from web crawls, news, and Wikipedia
- Morpheme-aware pretraining: Better understanding of Korean grammatical structure
- Domain relevance: Pretrained on the same distribution as KLUE benchmark data
2. ELECTRA's Sample Efficiency
The ELECTRA pretraining objective (replaced token detection) is more sample-efficient than masked language modeling:
- While BERT learns from ~15% of tokens (masked positions), ELECTRA learns from 100% of tokens (detecting which tokens were replaced)
- This yields stronger representations at equivalent compute, critical for a 110M-parameter model
3. Manageable Model Size
At 110M parameters, KoELECTRA-base fits within modest compute budgets while maintaining competitive performance.
The final architecture adds two components atop the transformer encoder:
BiLSTM Layer (256 hidden units, bidirectional)
Transformers capture long-range dependencies through self-attention, but LSTMs provide complementary sequential inductive biases:
- Explicit directionality: Left-to-right and right-to-left passes capture sequential dependencies
- Memory cells: LSTMs maintain state across long sequences, useful for multi-token entities
- Implementation: Processes 768-dim KoELECTRA outputs → 512-dim contextualized representations
CRF Layer (Conditional Random Field)
Standard classification heads predict each token independently, permitting invalid sequences:
Invalid prediction: [B-PER, I-LOC, I-LOC]
Problem: I-LOC cannot follow B-PER without an intervening B-LOC
The CRF models transition scores between tags, learning:
B-X → I-Xhas high score (valid continuation)B-X → I-Yhas low score (invalid type change)
During inference, Viterbi decoding finds the highest-scoring valid path through the label lattice, eliminating structurally impossible sequences.
Input Characters
↓
┌─────────────────────────────────┐
│ KoELECTRA-base-v3 Encoder │ → 768-dimensional representations
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ BiLSTM Layer (256 × 2) │ → Bidirectional context aggregation
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ Linear Projection (512 → 13) │ → Emission scores per tag
└─────────────────────────────────┘
↓
┌─────────────────────────────────┐
│ CRF Layer │ → Viterbi decoding for optimal sequence
└─────────────────────────────────┘
↓
BIO-tagged Entity Predictions
To maximize performance, I employed several modern NLP techniques:
| Technique | Purpose | Implementation | Expected Gain |
|---|---|---|---|
| FGM (Fast Gradient Method) | Adversarial perturbations on embeddings improve robustness | ε=0.5 perturbation on word embeddings | +0.5-1.0% F1 |
| R-Drop | Consistency regularization via KL divergence between dropout variations | α=0.5 weight on KL term | +0.3-0.5% F1 |
| Multi-Sample Dropout | Averaging predictions across 5 dropout masks | 5 forward passes with different masks | +0.2-0.4% F1 |
| Mixed Precision (FP16) | Reduce memory footprint and accelerate training | PyTorch GradScaler with dynamic loss scaling | 2× memory efficiency |
| Differential Learning Rates | Preserve pretrained encoder knowledge while training new layers | Encoder: 3e-5, Head: 1e-3 | Faster convergence |
| Gradient Clipping | Prevent exploding gradients from CRF layer | max_norm=1.0 | Training stability |
Adversarial training improves model robustness by injecting noise during training. FGM (Fast Gradient Method) perturbs embeddings in the direction that increases loss:
- Compute loss L and backpropagate to get gradients wrt embeddings
- Add perturbation: e_adv = ε · g / ||g|| (normalized gradient direction)
- Compute adversarial loss L_adv with perturbed embeddings
- Backpropagate L_adv to update parameters
- Restore original embeddings
This forces the model to learn representations that remain stable under small input perturbations, improving generalization.
Dropout is typically viewed as a regularization method, but dropout masks create stochastic predictions. R-Drop enforces consistency:
- Forward pass with dropout → distribution P1
- Forward pass with different dropout mask → distribution P2
- Minimize KL divergence: KL(P1 || P2) + KL(P2 || P1)
This penalizes the model when different dropout masks produce divergent predictions, encouraging it to learn features that are robust to dropout noise.
Transformer training typically requires 16-24GB VRAM for batch sizes that enable stable gradient estimates. Training on a consumer GPU (NVIDIA RTX 3050 with 4GB VRAM) imposed constraints:
| Parameter | Typical Setup | Constrained Setup | Trade-off |
|---|---|---|---|
| Batch Size | 64-128 | 32 | Noisier gradients, slower convergence |
| Sequence Length | 512 | 128 | Truncated long documents |
| Precision | FP32 | FP16 (mixed) | Numerical stability concerns |
Despite these limitations:
- Training time: 17 epochs in ~34 hours (2 hours/epoch)
- Memory management: Aggressive garbage collection after each epoch
- Result: Achieved 99.8% of baseline performance
This demonstrates that systematic engineering choices can compensate for hardware constraints—meaningful NLP research remains accessible without enterprise resources.
| Parameter | Value | Rationale |
|---|---|---|
| Batch Size | 32 | Maximum fitting in 4GB VRAM |
| Max Sequence Length | 128 | Balance coverage vs memory |
| Encoder Learning Rate | 3e-5 | Preserve pretrained knowledge |
| Head Learning Rate | 1e-3 | Faster learning for new layers |
| Warmup Ratio | 10% | Stabilize early training |
| Epochs | 17 | Until convergence (early stopping) |
| Gradient Clipping | 1.0 | Prevent CRF gradient explosions |
| FGM Epsilon | 0.5 | Standard adversarial strength |
| R-Drop Alpha | 0.5 | Balance task loss + consistency |
After 34 hours of training and achieving 85.9% validation F1, I built a Gradio demo for interactive testing. The results were catastrophic:
Input: "삼성전자 이재용 회장" (Samsung Electronics Chairman Lee Jae-yong)
Expected: [ORG: 삼성전자] [PER: 이재용]
Actual: Random nonsense predictions
Despite strong validation metrics, the model failed completely on custom inputs.
I systematically traced the inference pipeline:
- ✓ Model weights: Loaded correctly from checkpoint
- ✓ Label mapping: IDs matched training labels
- ✗ Tokenization: Mismatch discovered
The KLUE dataset uses character-level tokenization, not the morpheme-level tokenization I assumed:
| My Assumption | KLUE Reality |
|---|---|
Tokens: ['김민수', '교수', '는'] |
Tokens: ['김', '민', '수', ' ', '교', '수', '는'] |
| 3 tokens | 7 tokens |
| Labels aligned per morpheme | Labels aligned per character |
When I tokenized "김민수" as a single token, the model received 1 embedding instead of 3, completely misaligning the label sequence.
def char_tokenize(text):
"""Character-level tokenization matching KLUE format."""
return list(text) # Each character becomes a tokenThis immediately fixed the demo. The model was never broken—only the preprocessing was misaligned.
A second subtle bug emerged: label IDs differed between training and inference:
Training (KLUE): [B-DT, I-DT, B-LC, I-LC, B-OG, ...]
Inference (bug): [O, B-PS, I-PS, B-LC, B-OG, ...]
When the model predicted ID 6 (B-PS in KLUE), the inference code interpreted it as B-OG. Every entity was systematically misclassified.
Root cause: I manually defined labels instead of loading them from the dataset's .features metadata.
Lesson learned: Always load label lists programmatically from dataset metadata to avoid synchronization bugs.
| Metric | This Model | KLUE Baseline | Gap |
|---|---|---|---|
| F1 | 85.90% | 86.11% | -0.21% |
| Precision | 86.39% | — | — |
| Recall | 85.41% | — | — |
The model reaches 99.8% of baseline performance through careful architectural engineering and training strategy.
| Entity | F1 | Precision | Recall | Support | Analysis |
|---|---|---|---|---|---|
| QT (Quantity) | 91.8% | 91.6% | 92.0% | 3,150 | Strong: Numeric patterns are highly regular |
| TI (Time) | 91.5% | 90.6% | 92.3% | 545 | Strong: Temporal expressions are formulaic |
| PS (Person) | 88.3% | 90.8% | 85.9% | 4,418 | High precision, lower recall: Conservative on names |
| DT (Date) | 88.1% | 87.7% | 88.5% | 2,312 | Balanced: Calendar expressions well-structured |
| OG (Organization) | 77.8% | 77.2% | 78.4% | 2,182 | Weak: Confused with locations (명사-시) |
| LC (Location) | 74.2% | 74.2% | 74.2% | 1,648 | Weak: Ambiguous with organizations |
Key Findings:
- Temporal/quantitative entities exceed 90% F1: Structured patterns (dates, times, numbers) are easiest to recognize
- Person entities show precision-recall gap: The model is conservative—it misses some names (85.9% recall) but rarely false-positives (90.8% precision)
- Location-Organization confusion: Both entity types fall below the 86% overall F1 threshold, indicating systematic confusion
The normalized confusion matrix reveals:
- Strong diagonal (0.82-0.99): Most predictions are correct
- B-LC ↔ B-OG spillover (6% each direction): Locations and organizations share naming patterns (e.g., "서울시" = Seoul City OR Seoul Metropolitan Government)
- I-OG → O leakage (10%): The model struggles with organization entity boundaries, prematurely terminating multi-token organizations
- Clean temporal tags: DT, TI, QT show minimal off-diagonal confusion
Analysis of 5,000 validation samples identified 4,592 total errors:
| Error Type | Count | Percentage | Example |
|---|---|---|---|
| Boundary Errors | 2,674 | 58.2% | Predicted "5월" instead of "5월 6일" |
| Spurious Entities | 888 | 19.3% | Tagged non-entity as entity |
| Missed Entities | 680 | 14.8% | Failed to detect entity |
| Type Confusion | 350 | 7.6% | Predicted "서울" as ORG instead of LOC |
Dominant pattern: 58% of errors are boundary errors—the model identifies that an entity exists but misjudges its span. This is particularly common with:
- Compound expressions: "5월 6일" (May 6th) split into "5월" and "6일"
- Multi-word organizations: "대한민국 정부" (Government of South Korea) truncated to "대한민국"
| True↓ Pred→ | PS | LC | OG | DT | TI | QT |
|---|---|---|---|---|---|---|
| PS | — | 32 | 29 | 2 | 0 | 5 |
| LC | 20 | — | 99 | 1 | 0 | 1 |
| OG | 25 | 89 | — | 1 | 0 | 0 |
| DT | 0 | 0 | 2 | — | 1 | 12 |
| TI | 0 | 0 | 0 | 1 | — | 4 |
| QT | 2 | 3 | 1 | 15 | 5 | — |
LC ↔ OG confusion dominates (188 errors total). Korean administrative divisions use the suffix "-시" (city), creating ambiguity:
- Location context: "서울시는 대한민국의 수도이다" (Seoul is the capital of South Korea)
- Organization context: "서울시가 발표한 정책" (Policy announced by Seoul Metropolitan Government)
Without world knowledge, surface forms are insufficient to disambiguate.
The Gradio application provides real-time entity extraction with color-coded highlighting:
python app.pyNavigate to http://localhost:7860 to test custom Korean text.
Example output for "삼성전자 이재용 회장이 서울에서 기자회견을 열었다":
| Entity | Type | Color |
|---|---|---|
| 삼성전자 | 🏢 Organization | Blue |
| 이재용 | 👤 Person | Red |
| 서울 | 📍 Location | Teal |
- Python 3.11+
- CUDA-capable GPU (recommended) or CPU
- 4GB+ VRAM for inference, 4GB+ for training with mixed precision
# Clone the repository
git clone https://github.com/yourusername/korean_ner.git
cd korean_ner
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Download model weights
pip install huggingface_hub
huggingface-cli download mrleast/koelectra_ner --local-dir weights/python app.pycd evaluate
python evaluate.pyGenerates classification reports, confusion matrices, and per-entity visualizations.
The training notebook (train_model.ipynb) includes:
- KLUE NER dataset loading and preprocessing
- Model architecture definition with CRF
- FGM adversarial training implementation
- R-Drop consistency regularization
- Mixed precision training with gradient scaling
- Checkpoint saving and early stopping
cd evaluate
python error_analysis.py # Error categorization
python attention_viz.py # Attention heatmaps
python benchmark.py # GPU/CPU speed test
python calibration.py # Confidence analysis| Category | Technology |
|---|---|
| Framework | PyTorch 2.0+, Transformers |
| Encoder | KoELECTRA-base-v3 (monologg) |
| Structured Prediction | pytorch-crf |
| Evaluation | seqeval |
| Demo | Gradio |
| Visualization | Matplotlib, Seaborn |
A comprehensive ablation study framework is ready to quantify individual technique contributions:
cd ablation
python train.py # Run all 6 experiments
python train.py --experiment full # Run specific experiment
python train.py --status # Check progress
python train.py --report # Generate comparison reportExperiments:
| Name | BiLSTM | CRF | FGM | R-Drop |
|---|---|---|---|---|
| baseline | ✗ | ✗ | ✗ | ✗ |
| +crf | ✗ | ✓ | ✗ | ✗ |
| +bilstm | ✓ | ✗ | ✗ | ✗ |
| full_no_fgm | ✓ | ✓ | ✗ | ✓ |
| full_no_rdrop | ✓ | ✓ | ✓ | ✗ |
| full | ✓ | ✓ | ✓ | ✓ |
The script supports resume from interruption—checkpoints are saved after each epoch, and training automatically continues from the last completed epoch.
-
Verify preprocessing assumptions early: The tokenization mismatch cost significant debugging time. Always inspect raw dataset examples before building pipelines.
-
Label mappings are subtle but critical: A simple ordering difference caused complete prediction failure. Load label lists programmatically from dataset metadata.
-
Structured prediction enforces coherence: CRFs eliminate invalid tag sequences through transition modeling. The full contribution will be quantified through the planned ablation study.
-
Modern regularization techniques are promising: FGM and R-Drop are implemented based on their strong theoretical motivation. The ablation study will measure their individual contributions empirically.
1. The Importance of Structured Prediction
Independent token classification permits invalid sequences. CRFs enforce global coherence by modeling transition probabilities, providing consistent (if modest) improvements on sequence labeling.
2. Adversarial Training as Data Augmentation
FGM provides an efficient form of data augmentation—perturbing embeddings creates "synthetic" examples without additional annotation. This is particularly valuable when training data is limited.
3. Consistency Regularization for Robustness
R-Drop's KL penalty encourages predictions to remain stable across dropout masks. This reduces overfitting to specific neuron configurations, improving generalization.
4. Engineering Matters as Much as Architecture
The tokenization and label mapping bugs demonstrate that correctness is as important as sophistication. Systematic validation of preprocessing assumptions prevents subtle failures that undermine model performance.
- Park, S., et al. (2021). KLUE: Korean Language Understanding Evaluation. NeurIPS Datasets and Benchmarks Track.
- Park, J. (2020). KoELECTRA: Pretrained ELECTRA Model for Korean. GitHub Repository.
- Clark, K., et al. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. ICLR.
- Lample, G., et al. (2016). Neural Architectures for Named Entity Recognition. NAACL. (BiLSTM-CRF architecture)
- Miyato, T., et al. (2017). Adversarial Training Methods for Semi-Supervised Text Classification. ICLR. (FGM inspiration)
- Liang, X., et al. (2021). R-Drop: Regularized Dropout for Neural Networks. NeurIPS.
Amirbek Yaqubboyev
📧 akubbaevamirbek@gmail.com
🔗 GitHub
This project was developed as part of my graduate school application portfolio, demonstrating end-to-end NLP pipeline development from problem formulation through deployment and analysis.
Last updated: January 2026

