A hackathon-ready MVP that processes audio clips to perform ASR (Automatic Speech Recognition) and sound event detection, fuses the outputs into an Audio Memory Graph (AMG), and answers natural language questions using LLMs.
- Audio Processing: Upload 20-30 second audio clips
- ASR: Whisper-based speech transcription with timestamps
- Event Detection: YAMNet-based non-speech sound detection
- AMG Fusion: Combines transcripts and events into unified timeline
- Q&A Interface: Natural language questions about audio content
- Dual LLM Support: OpenAI API or local Hugging Face models
- Evaluation Suite: WER and F1 metrics for performance assessment
- Privacy-First: Auto-deletes uploaded audio (configurable)
# Clone and navigate to the repository
git clone <repository-url>
cd ALM-MVP
# Set your OpenAI API key
export OPENAI_API_KEY="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"
# Build and run with Docker Compose
docker-compose up --build
# Access the app at http://localhost:8501# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export OPENAI_API_KEY="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" # Optional for OpenAI mode
export LLM_MODE="openai" # or "local"
export MAX_AUDIO_SEC=30
export SAVE_RAW=false
# Run the application
streamlit run app/app.pyALM-MVP/
βββ app/
β βββ app.py # Streamlit web interface
βββ models/
β βββ asr.py # Whisper ASR module
β βββ events.py # YAMNet event detection
βββ fusion/
β βββ build_amg.py # AMG timeline fusion
βββ reasoning/
β βββ reasoner.py # LLM reasoning module
βββ tests/
β βββ emergency_labels.json # Test case 1: Emergency
β βββ calm_chat_labels.json # Test case 2: Calm conversation
β βββ traffic_ambiguous_labels.json # Test case 3: Ambiguous
β βββ README.md # Instructions for creating test audio
βββ config/ # Configuration files
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker configuration
βββ docker-compose.yml # Docker Compose setup
βββ evaluate.py # Evaluation script
βββ README.md # This file
| Variable | Default | Description |
|---|---|---|
LLM_MODE |
openai |
LLM mode: openai or local |
OPENAI_API_KEY |
- | OpenAI API key (required for OpenAI mode) |
MAX_AUDIO_SEC |
30 |
Maximum audio duration in seconds |
SAVE_RAW |
false |
Save uploaded audio files (debug only) |
- ASR: Whisper (configurable model size: tiny, base, small, medium, large)
- Event Detection: YAMNet (Google's AudioSet model)
- LLM: OpenAI GPT-3.5-turbo or local Hugging Face models
- Upload Audio: Select an audio file (WAV, MP3, M4A, FLAC, OGG)
- Process: Click "Process Audio" to run ASR and event detection
- View Timeline: See the unified AMG timeline with speech and events
- Ask Questions: Use the chat interface to ask about the audio content
- "Was this an emergency situation?"
- "What sounds were detected in the audio?"
- "What did the person say?"
- "Describe what happened in this audio clip."
from models.asr import transcribe
from models.events import detect_events
from fusion.build_amg import build_amg
from reasoning.reasoner import answer_question
# Process audio
transcripts = transcribe("audio.wav")
events = detect_events("audio.wav")
amg_timeline = build_amg(transcripts, events)
# Ask questions
result = answer_question(amg_timeline, "Was this an emergency?")
print(result["answer"])# Evaluate all test cases
python evaluate.py --test_dir tests
# Evaluate single file
python evaluate.py --single_file tests/emergency.wav --ground_truth tests/emergency_labels.json
# View results
cat evaluation_results.json- WER (Word Error Rate): ASR transcription accuracy
- F1 Score: Event detection precision/recall
- Emergency Detection Accuracy: Binary classification performance
- Emergency: Siren + panicked voice + alarm sounds
- Calm Chat: Friendly conversation + door closing
- Traffic Ambiguous: Car horn + shouting + friendly resolution
# Build and run
docker-compose up --build
# Run evaluation
docker-compose --profile eval up alm-eval# Build production image
docker build -t alm-mvp .
# Run with environment variables
docker run -p 8501:8501 \
-e OPENAI_API_KEY="your-key" \
-e LLM_MODE="openai" \
alm-mvp- Auto-delete: Uploaded audio files are automatically deleted after processing
- No Persistence: No audio data is stored permanently
- Configurable: Set
SAVE_RAW=truefor debugging (not recommended for production) - Consent Required: Document user consent for audio processing
- Python 3.9+
- FFmpeg (for audio processing)
- CUDA (optional, for GPU acceleration)
# Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8
# Run tests
pytest
# Format code
black .
# Lint code
flake8 .- ASR: Extend
models/asr.pywith new model class - Events: Extend
models/events.pywith new detector - LLM: Extend
reasoning/reasoner.pywith new reasoner
- Processing Time: < 30 seconds for 30-second audio
- Memory Usage: ~2GB RAM (with small models)
- Accuracy:
- WER: ~0.1-0.3 (depending on audio quality)
- Event F1: ~0.7-0.9 (depending on event types)
- Use smaller models for faster processing
- Enable GPU acceleration for better performance
- Implement streaming for longer audio files
- CUDA Out of Memory: Use smaller models or CPU-only mode
- Audio Format Issues: Convert to WAV format
- OpenAI API Errors: Check API key and rate limits
- Model Download Issues: Ensure internet connection for first run
# Enable debug logging
export LOG_LEVEL=DEBUG
# Save raw audio for inspection
export SAVE_RAW=true
# Use local models only
export LLM_MODE=local- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper: For ASR capabilities
- Google YAMNet: For audio event detection
- Streamlit: For the web interface
- Hugging Face: For local LLM support