Skip to content

Viswa2-20/ALM-MVP-SIH

Repository files navigation

ALM (Audio Language Model) MVP

A hackathon-ready MVP that processes audio clips to perform ASR (Automatic Speech Recognition) and sound event detection, fuses the outputs into an Audio Memory Graph (AMG), and answers natural language questions using LLMs.

Features

  • Audio Processing: Upload 20-30 second audio clips
  • ASR: Whisper-based speech transcription with timestamps
  • Event Detection: YAMNet-based non-speech sound detection
  • AMG Fusion: Combines transcripts and events into unified timeline
  • Q&A Interface: Natural language questions about audio content
  • Dual LLM Support: OpenAI API or local Hugging Face models
  • Evaluation Suite: WER and F1 metrics for performance assessment
  • Privacy-First: Auto-deletes uploaded audio (configurable)

Quick Start

Option 1: Docker (Recommended)

# Clone and navigate to the repository
git clone <repository-url>
cd ALM-MVP

# Set your OpenAI API key
export OPENAI_API_KEY="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"

# Build and run with Docker Compose
docker-compose up --build

# Access the app at http://localhost:8501

Option 2: Local Installation

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export OPENAI_API_KEY="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"  # Optional for OpenAI mode
export LLM_MODE="openai"  # or "local"
export MAX_AUDIO_SEC=30
export SAVE_RAW=false

# Run the application
streamlit run app/app.py

πŸ“ Project Structure

ALM-MVP/
β”œβ”€β”€ app/
β”‚   └── app.py                 # Streamlit web interface
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ asr.py                 # Whisper ASR module
β”‚   └── events.py              # YAMNet event detection
β”œβ”€β”€ fusion/
β”‚   └── build_amg.py           # AMG timeline fusion
β”œβ”€β”€ reasoning/
β”‚   └── reasoner.py            # LLM reasoning module
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ emergency_labels.json  # Test case 1: Emergency
β”‚   β”œβ”€β”€ calm_chat_labels.json  # Test case 2: Calm conversation
β”‚   β”œβ”€β”€ traffic_ambiguous_labels.json  # Test case 3: Ambiguous
β”‚   └── README.md              # Instructions for creating test audio
β”œβ”€β”€ config/                    # Configuration files
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ Dockerfile                 # Docker configuration
β”œβ”€β”€ docker-compose.yml         # Docker Compose setup
β”œβ”€β”€ evaluate.py                # Evaluation script
└── README.md                  # This file

Configuration

Environment Variables

Variable Default Description
LLM_MODE openai LLM mode: openai or local
OPENAI_API_KEY - OpenAI API key (required for OpenAI mode)
MAX_AUDIO_SEC 30 Maximum audio duration in seconds
SAVE_RAW false Save uploaded audio files (debug only)

Model Configuration

  • ASR: Whisper (configurable model size: tiny, base, small, medium, large)
  • Event Detection: YAMNet (Google's AudioSet model)
  • LLM: OpenAI GPT-3.5-turbo or local Hugging Face models

Usage

Web Interface

  1. Upload Audio: Select an audio file (WAV, MP3, M4A, FLAC, OGG)
  2. Process: Click "Process Audio" to run ASR and event detection
  3. View Timeline: See the unified AMG timeline with speech and events
  4. Ask Questions: Use the chat interface to ask about the audio content

Example Questions

  • "Was this an emergency situation?"
  • "What sounds were detected in the audio?"
  • "What did the person say?"
  • "Describe what happened in this audio clip."

API Usage

from models.asr import transcribe
from models.events import detect_events
from fusion.build_amg import build_amg
from reasoning.reasoner import answer_question

# Process audio
transcripts = transcribe("audio.wav")
events = detect_events("audio.wav")
amg_timeline = build_amg(transcripts, events)

# Ask questions
result = answer_question(amg_timeline, "Was this an emergency?")
print(result["answer"])

Evaluation

Running Evaluation

# Evaluate all test cases
python evaluate.py --test_dir tests

# Evaluate single file
python evaluate.py --single_file tests/emergency.wav --ground_truth tests/emergency_labels.json

# View results
cat evaluation_results.json

Metrics

  • WER (Word Error Rate): ASR transcription accuracy
  • F1 Score: Event detection precision/recall
  • Emergency Detection Accuracy: Binary classification performance

Test Cases

  1. Emergency: Siren + panicked voice + alarm sounds
  2. Calm Chat: Friendly conversation + door closing
  3. Traffic Ambiguous: Car horn + shouting + friendly resolution

Docker Deployment

Development

# Build and run
docker-compose up --build

# Run evaluation
docker-compose --profile eval up alm-eval

Production

# Build production image
docker build -t alm-mvp .

# Run with environment variables
docker run -p 8501:8501 \
  -e OPENAI_API_KEY="your-key" \
  -e LLM_MODE="openai" \
  alm-mvp

Privacy & Security

  • Auto-delete: Uploaded audio files are automatically deleted after processing
  • No Persistence: No audio data is stored permanently
  • Configurable: Set SAVE_RAW=true for debugging (not recommended for production)
  • Consent Required: Document user consent for audio processing

Development

Prerequisites

  • Python 3.9+
  • FFmpeg (for audio processing)
  • CUDA (optional, for GPU acceleration)

Setup Development Environment

# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8

# Run tests
pytest

# Format code
black .

# Lint code
flake8 .

Adding New Models

  1. ASR: Extend models/asr.py with new model class
  2. Events: Extend models/events.py with new detector
  3. LLM: Extend reasoning/reasoner.py with new reasoner

πŸ“ˆ Performance

Benchmarks

  • Processing Time: < 30 seconds for 30-second audio
  • Memory Usage: ~2GB RAM (with small models)
  • Accuracy:
    • WER: ~0.1-0.3 (depending on audio quality)
    • Event F1: ~0.7-0.9 (depending on event types)

Optimization

  • Use smaller models for faster processing
  • Enable GPU acceleration for better performance
  • Implement streaming for longer audio files

Troubleshooting

Common Issues

  1. CUDA Out of Memory: Use smaller models or CPU-only mode
  2. Audio Format Issues: Convert to WAV format
  3. OpenAI API Errors: Check API key and rate limits
  4. Model Download Issues: Ensure internet connection for first run

Debug Mode

# Enable debug logging
export LOG_LEVEL=DEBUG

# Save raw audio for inspection
export SAVE_RAW=true

# Use local models only
export LLM_MODE=local

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OpenAI Whisper: For ASR capabilities
  • Google YAMNet: For audio event detection
  • Streamlit: For the web interface
  • Hugging Face: For local LLM support

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors