OpenNyAI is an open-source Natural Language Processing (NLP) initiative designed to build Sovereign Legal AI for the Indian Judicial System. With over 5.3 crore pending cases, the Indian judiciary faces an unprecedented operational challenge. This project develops custom, domain-specific NLP models that understand the linguistic nuances of Indian Legal English, IPC, CrPC, and multilingual judgments.
β οΈ Sovereign AI Philosophy: This project prioritizes indigenous, self-hosted models over generic Western-centric LLMs to ensure data sovereignty, linguistic alignment, and reduced hallucination in high-stakes legal environments.
| Task | Model Architecture | Purpose |
|---|---|---|
| Legal NER | InLegalBERT | Extract 14 entity types (PETITIONER, RESPONDENT, STATUTE, etc.) |
| Rhetorical Role Labeling | BiLSTM-CRF + Transformer | Segment judgments into 13 functional parts |
| Case Summarization | Llama 3 + RAG | Generate legally sound, structured summaries |
| Legal Reasoning | Instruction-Tuned Llama 3 | Draft arguments, simplify legal language |
| Judgment Prediction | InLegalBERT Classifier | Predict case outcomes based on ILDC corpus |
OpenNyAI/
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
βββ setup.py # Package setup file
βββ .env.example # Environment variables template
β
βββ data/
β βββ raw/ # Raw legal documents (Indian Kanoon, e-Courts)
β βββ processed/ # Preprocessed data (CoNLL format, JSONL)
β βββ annotations/ # Annotated datasets (NER, RRL)
β βββ corpora/ # Standard datasets (ILDC, InJudgements, Aalap)
β
βββ src/
β βββ __init__.py
β βββ data/
β β βββ scraper.py # Indian Kanoon & e-Courts scraping
β β βββ loader.py # Data loading utilities
β β βββ preprocessor.py # Regex-based cleaning & normalization
β β βββ chunker.py # Semantic chunking for long documents
β β
β βββ models/
β β βββ inlegalbert.py # InLegalBERT base encoder
β β βββ ner_model.py # Legal Named Entity Recognition (14 classes)
β β βββ rrl_model.py # Rhetorical Role Labeling (13 roles)
β β βββ summarizer.py # Extractive + Abstractive Summarization
β β βββ classifier.py # Document Classification
β β βββ llama_reasoning.py # Llama 3 instruction-tuned reasoning
β β
β βββ training/
β β βββ trainer.py # HuggingFace Trainer wrapper
β β βββ lora_finetuning.py # LoRA/QLoRA for Llama 3
β β βββ evaluate.py # seqeval, ROUGE, classification metrics
β β
β βββ rag/
β β βββ vectorstore.py # Milvus/Chroma integration
β β βββ retriever.py # Precedent retrieval
β β βββ rag_pipeline.py # RAG for evidence-based reasoning
β β
β βββ utils/
β βββ config.py # Configuration management
β βββ regex_patterns.py # Indian legal citation patterns
β βββ helpers.py # Utility functions
β
βββ notebooks/
β βββ 01_data_exploration.ipynb
β βββ 02_preprocessing.ipynb
β βββ 03_inlegalbert_ner.ipynb
β βββ 04_rhetorical_roles.ipynb
β βββ 05_llama_finetuning.ipynb
β
βββ scripts/
β βββ scrape_indian_kanoon.py # Data collection script
β βββ train.py # Training script
β βββ predict.py # Inference script
β βββ evaluate.py # Evaluation script
β
βββ configs/
β βββ model_config.yaml # Model configurations
β βββ training_config.yaml # Training parameters
β βββ lora_config.yaml # LoRA hyperparameters
β
βββ tests/
βββ test_data.py
βββ test_models.py
βββ test_regex.py
Based on the OpenNyAI specification:
| Entity Type | Description | Example |
|---|---|---|
COURT |
Court name | "Supreme Court of India" |
PETITIONER |
Person/Org filing case | "Kesavananda Bharati" |
RESPONDENT |
Person/Org defending | "State of Kerala" |
JUDGE |
Presiding judge | "Hon'ble Justice DY Chandrachud" |
LAWYER |
Legal representatives | "Adv. Fali S. Nariman" |
STATUTE |
Legal Act | "Indian Penal Code" |
PROVISION |
Section/Article | "Section 302", "Article 21" |
PRECEDENT |
Case citations | "AIR 1973 SC 1461" |
CASE_NUMBER |
Case reference | "Writ Petition (C) No. 135/2019" |
DATE |
Important dates | "14th February 2024" |
GPE |
Geopolitical entity | "Maharashtra", "New Delhi" |
ORG |
Organization | "CBI", "RBI" |
WITNESS |
Witnesses | "PW-1", "DW-3" |
EVIDENCE |
Evidence references | "Exhibit P-1" |
| Role | Description |
|---|---|
PREAMBLE |
Case header, parties, court info |
FACTS |
Factual background of the case |
ISSUE |
Legal questions to be decided |
ARGUMENT_PETITIONER |
Arguments by petitioner's counsel |
ARGUMENT_RESPONDENT |
Arguments by respondent's counsel |
ANALYSIS |
Court's examination of issues |
STATUTE |
Statutory provisions discussed |
PRECEDENT_RELIED |
Cases cited and followed |
PRECEDENT_NOT_RELIED |
Cases cited but distinguished |
RATIO |
Legal principle established |
RULING_LOWER_COURT |
Lower court's decision |
RULING_PRESENT_COURT |
Current court's decision |
NONE |
Non-classifiable content |
from transformers import AutoModel, AutoTokenizer
# Load InLegalBERT (pre-trained on 5.4M Indian legal documents)
tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
model = AutoModel.from_pretrained("law-ai/InLegalBERT")Performance Comparison:
| Metric | BERT-Base | InLegalBERT | Improvement |
|---|---|---|---|
| NER F1 | ~78% | ~84% | +6% |
| RRL Accuracy | ~72% | ~79% | +7% |
| Convergence | Slower | Faster | ~30% fewer epochs |
| Dataset | Size | Use Case | Source |
|---|---|---|---|
| ILDC | 35K cases | Judgment prediction | OpenDataLab |
| InJudgements | Balanced sample | General training | HuggingFace |
| Aalap Instruction | Instruction pairs | LLM fine-tuning | HuggingFace |
| BUILD | Annotated | Rhetorical roles | GitHub |
- Python 3.9+
- CUDA 11.8+ (for GPU training)
- 16GB+ RAM (32GB recommended for Llama 3)
# Clone the repository
git clone https://github.com/viru0909-dev/OpenNyAI.git
cd OpenNyAI
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download spaCy model
python -m spacy download en_core_web_smfrom src.models import LegalNERModel, RhetoricalRoleLabeler
from src.data import LegalTextPreprocessor
# Initialize preprocessor
preprocessor = LegalTextPreprocessor()
# Load NER model
ner_model = LegalNERModel(model_name="law-ai/InLegalBERT")
ner_model.load_model()
# Extract entities from judgment text
text = """
In the matter of Kesavananda Bharati v. State of Kerala,
the Hon'ble Supreme Court examined Article 368 of the Constitution.
The Court, comprising a 13-judge bench, delivered its verdict on 24th April 1973.
"""
entities = ner_model.predict(text)
print(entities)
# [{'text': 'Kesavananda Bharati', 'label': 'PETITIONER', ...},
# {'text': 'State of Kerala', 'label': 'RESPONDENT', ...},
# {'text': 'Article 368', 'label': 'PROVISION', ...}, ...]python scripts/train.py \
--model ner \
--base-model law-ai/InLegalBERT \
--train-data data/processed/ner_train.json \
--val-data data/processed/ner_val.json \
--output-dir models/legal_ner \
--epochs 10 \
--batch-size 16python scripts/train.py \
--model llama \
--base-model meta-llama/Meta-Llama-3-8B \
--train-data data/corpora/aalap_instructions.jsonl \
--output-dir models/legal_llama \
--use-lora \
--lora-r 16 \
--lora-alpha 32| Layer | Technology | Purpose |
|---|---|---|
| Encoder | InLegalBERT | Indian legal text understanding |
| Generator | Llama 3 (8B/70B) | Legal reasoning & drafting |
| Fine-tuning | LoRA/QLoRA | Parameter-efficient training |
| Serving | vLLM / Groq | High-throughput inference |
| Vector DB | Milvus / Chroma | RAG retrieval |
| Backend | FastAPI | Model serving |
| Orchestration | Spring Boot 3.2 | Enterprise integration |
- Project structure and base models
- Indian Kanoon scraping pipeline
- InLegalBERT NER fine-tuning
- Rhetorical Role Labeling (BiLSTM-CRF)
- Llama 3 instruction tuning with Aalap dataset
- RAG pipeline with Milvus
- Bhashini integration (multilingual)
- FastAPI inference server
- vLLM deployment configuration
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions! Please see our CONTRIBUTING.md for guidelines.
Building Sovereign Legal Intelligence for Accessible Justice in India
Made with β€οΈ by the OpenNyAI Community