Skip to content

viru0909-dev/OpenNyAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›οΈ OpenNyAI - Sovereign Legal Intelligence for Indian Judiciary

Python Version NLP LLM License Status

πŸ“‹ Overview

OpenNyAI is an open-source Natural Language Processing (NLP) initiative designed to build Sovereign Legal AI for the Indian Judicial System. With over 5.3 crore pending cases, the Indian judiciary faces an unprecedented operational challenge. This project develops custom, domain-specific NLP models that understand the linguistic nuances of Indian Legal English, IPC, CrPC, and multilingual judgments.

⚠️ Sovereign AI Philosophy: This project prioritizes indigenous, self-hosted models over generic Western-centric LLMs to ensure data sovereignty, linguistic alignment, and reduced hallucination in high-stakes legal environments.

🎯 Project Objectives

Task Model Architecture Purpose
Legal NER InLegalBERT Extract 14 entity types (PETITIONER, RESPONDENT, STATUTE, etc.)
Rhetorical Role Labeling BiLSTM-CRF + Transformer Segment judgments into 13 functional parts
Case Summarization Llama 3 + RAG Generate legally sound, structured summaries
Legal Reasoning Instruction-Tuned Llama 3 Draft arguments, simplify legal language
Judgment Prediction InLegalBERT Classifier Predict case outcomes based on ILDC corpus

πŸ—οΈ Project Structure

OpenNyAI/
β”œβ”€β”€ README.md                    # Project documentation
β”œβ”€β”€ requirements.txt             # Python dependencies
β”œβ”€β”€ setup.py                     # Package setup file
β”œβ”€β”€ .env.example                 # Environment variables template
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                     # Raw legal documents (Indian Kanoon, e-Courts)
β”‚   β”œβ”€β”€ processed/               # Preprocessed data (CoNLL format, JSONL)
β”‚   β”œβ”€β”€ annotations/             # Annotated datasets (NER, RRL)
β”‚   └── corpora/                 # Standard datasets (ILDC, InJudgements, Aalap)
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ scraper.py           # Indian Kanoon & e-Courts scraping
β”‚   β”‚   β”œβ”€β”€ loader.py            # Data loading utilities
β”‚   β”‚   β”œβ”€β”€ preprocessor.py      # Regex-based cleaning & normalization
β”‚   β”‚   └── chunker.py           # Semantic chunking for long documents
β”‚   β”‚
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ inlegalbert.py       # InLegalBERT base encoder
β”‚   β”‚   β”œβ”€β”€ ner_model.py         # Legal Named Entity Recognition (14 classes)
β”‚   β”‚   β”œβ”€β”€ rrl_model.py         # Rhetorical Role Labeling (13 roles)
β”‚   β”‚   β”œβ”€β”€ summarizer.py        # Extractive + Abstractive Summarization
β”‚   β”‚   β”œβ”€β”€ classifier.py        # Document Classification
β”‚   β”‚   └── llama_reasoning.py   # Llama 3 instruction-tuned reasoning
β”‚   β”‚
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ trainer.py           # HuggingFace Trainer wrapper
β”‚   β”‚   β”œβ”€β”€ lora_finetuning.py   # LoRA/QLoRA for Llama 3
β”‚   β”‚   └── evaluate.py          # seqeval, ROUGE, classification metrics
β”‚   β”‚
β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”œβ”€β”€ vectorstore.py       # Milvus/Chroma integration
β”‚   β”‚   β”œβ”€β”€ retriever.py         # Precedent retrieval
β”‚   β”‚   └── rag_pipeline.py      # RAG for evidence-based reasoning
β”‚   β”‚
β”‚   └── utils/
β”‚       β”œβ”€β”€ config.py            # Configuration management
β”‚       β”œβ”€β”€ regex_patterns.py    # Indian legal citation patterns
β”‚       └── helpers.py           # Utility functions
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_data_exploration.ipynb
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb
β”‚   β”œβ”€β”€ 03_inlegalbert_ner.ipynb
β”‚   β”œβ”€β”€ 04_rhetorical_roles.ipynb
β”‚   └── 05_llama_finetuning.ipynb
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ scrape_indian_kanoon.py  # Data collection script
β”‚   β”œβ”€β”€ train.py                 # Training script
β”‚   β”œβ”€β”€ predict.py               # Inference script
β”‚   └── evaluate.py              # Evaluation script
β”‚
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ model_config.yaml        # Model configurations
β”‚   β”œβ”€β”€ training_config.yaml     # Training parameters
β”‚   └── lora_config.yaml         # LoRA hyperparameters
β”‚
└── tests/
    β”œβ”€β”€ test_data.py
    β”œβ”€β”€ test_models.py
    └── test_regex.py

🧠 Model Architectures

1. Legal Named Entity Recognition (14 Entity Classes)

Based on the OpenNyAI specification:

Entity Type Description Example
COURT Court name "Supreme Court of India"
PETITIONER Person/Org filing case "Kesavananda Bharati"
RESPONDENT Person/Org defending "State of Kerala"
JUDGE Presiding judge "Hon'ble Justice DY Chandrachud"
LAWYER Legal representatives "Adv. Fali S. Nariman"
STATUTE Legal Act "Indian Penal Code"
PROVISION Section/Article "Section 302", "Article 21"
PRECEDENT Case citations "AIR 1973 SC 1461"
CASE_NUMBER Case reference "Writ Petition (C) No. 135/2019"
DATE Important dates "14th February 2024"
GPE Geopolitical entity "Maharashtra", "New Delhi"
ORG Organization "CBI", "RBI"
WITNESS Witnesses "PW-1", "DW-3"
EVIDENCE Evidence references "Exhibit P-1"

2. Rhetorical Role Labeling (13 Roles)

Role Description
PREAMBLE Case header, parties, court info
FACTS Factual background of the case
ISSUE Legal questions to be decided
ARGUMENT_PETITIONER Arguments by petitioner's counsel
ARGUMENT_RESPONDENT Arguments by respondent's counsel
ANALYSIS Court's examination of issues
STATUTE Statutory provisions discussed
PRECEDENT_RELIED Cases cited and followed
PRECEDENT_NOT_RELIED Cases cited but distinguished
RATIO Legal principle established
RULING_LOWER_COURT Lower court's decision
RULING_PRESENT_COURT Current court's decision
NONE Non-classifiable content

3. InLegalBERT - Sovereign Foundation

from transformers import AutoModel, AutoTokenizer

# Load InLegalBERT (pre-trained on 5.4M Indian legal documents)
tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
model = AutoModel.from_pretrained("law-ai/InLegalBERT")

Performance Comparison:

Metric BERT-Base InLegalBERT Improvement
NER F1 ~78% ~84% +6%
RRL Accuracy ~72% ~79% +7%
Convergence Slower Faster ~30% fewer epochs

πŸ“Š Datasets

Dataset Size Use Case Source
ILDC 35K cases Judgment prediction OpenDataLab
InJudgements Balanced sample General training HuggingFace
Aalap Instruction Instruction pairs LLM fine-tuning HuggingFace
BUILD Annotated Rhetorical roles GitHub

πŸš€ Quick Start

Prerequisites

  • Python 3.9+
  • CUDA 11.8+ (for GPU training)
  • 16GB+ RAM (32GB recommended for Llama 3)

Installation

# Clone the repository
git clone https://github.com/viru0909-dev/OpenNyAI.git
cd OpenNyAI

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm

Basic Usage

from src.models import LegalNERModel, RhetoricalRoleLabeler
from src.data import LegalTextPreprocessor

# Initialize preprocessor
preprocessor = LegalTextPreprocessor()

# Load NER model
ner_model = LegalNERModel(model_name="law-ai/InLegalBERT")
ner_model.load_model()

# Extract entities from judgment text
text = """
In the matter of Kesavananda Bharati v. State of Kerala,
the Hon'ble Supreme Court examined Article 368 of the Constitution.
The Court, comprising a 13-judge bench, delivered its verdict on 24th April 1973.
"""

entities = ner_model.predict(text)
print(entities)
# [{'text': 'Kesavananda Bharati', 'label': 'PETITIONER', ...},
#  {'text': 'State of Kerala', 'label': 'RESPONDENT', ...},
#  {'text': 'Article 368', 'label': 'PROVISION', ...}, ...]

πŸ”§ Training Custom Models

Fine-tune InLegalBERT for NER

python scripts/train.py \
    --model ner \
    --base-model law-ai/InLegalBERT \
    --train-data data/processed/ner_train.json \
    --val-data data/processed/ner_val.json \
    --output-dir models/legal_ner \
    --epochs 10 \
    --batch-size 16

Fine-tune Llama 3 with LoRA

python scripts/train.py \
    --model llama \
    --base-model meta-llama/Meta-Llama-3-8B \
    --train-data data/corpora/aalap_instructions.jsonl \
    --output-dir models/legal_llama \
    --use-lora \
    --lora-r 16 \
    --lora-alpha 32

πŸ› οΈ Technology Stack

Layer Technology Purpose
Encoder InLegalBERT Indian legal text understanding
Generator Llama 3 (8B/70B) Legal reasoning & drafting
Fine-tuning LoRA/QLoRA Parameter-efficient training
Serving vLLM / Groq High-throughput inference
Vector DB Milvus / Chroma RAG retrieval
Backend FastAPI Model serving
Orchestration Spring Boot 3.2 Enterprise integration

πŸ“ˆ Roadmap

  • Project structure and base models
  • Indian Kanoon scraping pipeline
  • InLegalBERT NER fine-tuning
  • Rhetorical Role Labeling (BiLSTM-CRF)
  • Llama 3 instruction tuning with Aalap dataset
  • RAG pipeline with Milvus
  • Bhashini integration (multilingual)
  • FastAPI inference server
  • vLLM deployment configuration

πŸ“š References

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

We welcome contributions! Please see our CONTRIBUTING.md for guidelines.


Building Sovereign Legal Intelligence for Accessible Justice in India
Made with ❀️ by the OpenNyAI Community

About

Sovereign Legal AI in the Indian Context

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published