DepoSense aims to revolutionize the traditional legal deposition process. Currently, depositions are recorded for post-mortem review, but the live session often lacks real-time analytical support. DepoSense enhances both the active deposition and the subsequent review by providing immediate, AI-driven insights.
Our software processes multi-modal streaming data from a web browser to a Python backend, delivering actionable intelligence back to the user interface in real-time.
Key Components & Features:
-
Real-time Audio Analysis:
- Live Transcription: Converts spoken dialogue into text instantly.
- Contradiction Detection: Flags inconsistencies in the transcribed script on the fly.
- Key Moment Highlighting: Identifies and marks significant conversational points for streamlined post-mortem review.
-
Real-time Video Analysis:
- Emotional Cue Detection: Processes facial expressions and body language to analyze emotional states.
- Contextual Follow-up Questions: Integrates emotional cues with conversational content to suggest pertinent follow-up questions in real time, maximizing deposition effectiveness.
This project is built for a hackathon, demonstrating a prototype for a multi-modal AI pipeline capable of processing streaming data from a web browser to a Python backend, and delivering real-time insights back to the user interface. This prototype leverages innovative open-source contributions, including VoiceStreamAI for real-time audio processing and transcription, and DeepFace for advanced facial analysis.
- Real-time Speech-to-Text (STT): Utilizes
faster-whisperfor efficient, high-quality audio transcription from live microphone input. - Voice Activity Detection (VAD): Employs
pyannote.audioto intelligently detect speech segments, ensuring only relevant audio is processed for transcription. - Real-time Facial Emotion Detection: Integrates
DeepFaceto analyze facial expressions from a live webcam feed and classify emotions (e.g., happy, sad, angry, neutral). - Multi-Modal Insights (Prototype): The foundation is laid for combining insights from both audio and visual streams to derive a more comprehensive understanding of communication.
- Seamless Web Interface: A user-friendly web interface allows easy connection to the backend, live audio/video streaming control, and real-time display of transcription and facial emotion data.
This outlines the main components and their organization within the DepoSense repository:
DepoSense/
├── VoiceStreamAI/ # Integrated VoiceStreamAI repository (for audio transcription)
│ ├── client/ # VoiceStreamAI's Frontend (HTML, CSS, JavaScript)
│ │ ├── index.html # Main web client application for audio ASR
│ │ ├── realtime-audio-processor.js # AudioWorklet for efficient audio processing
│ │ └── utils.js # Frontend utility functions
│ └── src/ # VoiceStreamAI's Python Backend for ASR and VAD
│ ├── main.py # VoiceStreamAI server entry point
│ ├── asr/ # ASR (Automatic Speech Recognition) pipeline components
│ │ ├── faster_whisper_asr.py # Faster-Whisper ASR integration
│ │ └── asr_factory.py # ASR pipeline factory
│ ├── vad/ # VAD (Voice Activity Detection) pipeline components
│ │ ├── pyannote_vad.py # Pyannote VAD integration
│ │ └── vad_factory.py # VAD pipeline factory
│ └── config.py # VoiceStreamAI server configuration
├── video_analysis/ # Custom Python module for real-time video analysis
│ └── video_analysis.py # Script for webcam processing and emotion analysis (using DeepFace)
├── output_audio.wav # Example audio file for testing (if applicable)
├── .env # Environment variables (e.g., Google Cloud credentials)
├── .venv/ # Python virtual environment for DepoSense project (created upon setup)
├── requirements.txt # Consolidated Python dependencies for the entire DepoSense project
└── README.md # This project's main documentation file
This guide assumes you have Python 3.9+ and pip installed.
First, clone the DepoSense repository to your local machine:
git clone https://github.com/Mohammed-Faizzzz/DepoSense.git
mv VoiceStreamAI DepoSense
cd DepoSenseThen, clone the VoiceStreamAI repository and move it to the DepoSense folder.
git clone https://github.com/alesaccoia/VoiceStreamAI.git # If you're starting from the base repo
mv VoiceStreamAI DepoSense # Rename to your project name
cd DepoSenseRepeat the following steps in the 2 directories: ./DepoSense and ./VoiceStreamAI.
It's highly recommended to use a Python virtual environment to manage dependencies.
python3 -m venv .venv # Create a new virtual environment
source .venv/bin/activate # Activate the virtual environment (on macOS/Linux)
# On Windows, use: .venv\Scripts\activateYour terminal prompt should now show (.venv) at the beginning, indicating the environment is active.
Now, install the dependencies within your active virtual environment:
pip install -r requirements.txtNote on torch for macOS (Apple Silicon): If you experience issues or want MPS acceleration, please refer to the official PyTorch website (https://pytorch.org/get-started/locally/) for the precise pip install command for your system.
The pyannote/segmentation model used for VAD is gated and requires a Hugging Face token.
- Agree to User Conditions: Go to https://huggingface.co/pyannote/segmentation and accept the user conditions.
- Generate a Token: Go to your Hugging Face settings (https://huggingface.co/settings/tokens), click "New token", give it a name (e.g.,
depo_sense), and select the "read" role. Copy the token immediately! - Log in via CLI (Recommended):
Alternatively, you can set an environment variable:
pip install huggingface_hub # If not already installed huggingface-cli login # Paste your token when prompted
export HUGGING_FACE_HUB_TOKEN="hf_YOUR_TOKEN_HERE".
Ensure your ASR and VAD models are configured to run on your CPU.
- Open
src/asr/faster_whisper_asr.py:- Locate the
WhisperModelinitialization (around line 117). - Ensure
device="cpu"andcompute_type="float32"are explicitly set:self.asr_pipeline = WhisperModel( model_size, device="cpu", # Ensure this is 'cpu' for non-GPU machines compute_type="float32", # Use float32 for CPU efficiency **kwargs )
- Locate the
- Open
src/vad/pyannote_vad.py:- Change the import:
from pyannote.audio.pipelines.speaker_verification import VoiceActivityDetection
- In the
PyannoteVADclass's__init__, ensure the model is moved to CPU:import torch # Add this import if not present # ... inside __init__ self.vad_pipeline = VoiceActivityDetection(segmentation="pyannote/segmentation") self.device = torch.device("cpu") self.vad_pipeline.to(self.device) # Move pipeline to CPU
- Change the import:
If you still see warnings about pyannote.audio 0.0.1 after updating pyannote.audio to 3.2.0, delete the old cached model:
rm -rf ~/.cache/torch/pyannote/models--pyannote--segmentation
# Or if it's in the Hugging Face Hub cache:
# rm -rf ~/.cache/huggingface/hub/models--pyannote--segmentationWith your (.venv) active, from the DepoSense/ root directory:
python3 -m src.main --vad-args '{"auth_token": "HF_TOKEN"}'You should see output indicating the WebSocket server is ready (e.g., WebSocket server ready to accept secure connections on 127.0.0.1:8765). This should be ideally run on a CUDA friendly machine so as to get realtime transcription.
The frontend needs to be served by a local web server to bypass browser security restrictions (CORS).
Do NOT close the terminal running your Python backend server.
cd client/python3 -m http.server 8000You should see output like Serving HTTP on :: port 8000 (http://[::]:8000/) ....
- Open your web browser.
- Go to
http://localhost:8000/index.html(or simplyhttp://localhost:8000/). - In the web interface, ensure the "WebSocket Address" is set to
ws://localhost:8765. - Click "Connect" and then "Start Streaming".
- Allow microphone access when prompted.
You should now see real-time transcriptions appear! (And if you implement the DeepFace integration, facial analysis as well).
- Robust Multi-Modal Fusion: Develop advanced algorithms to combine STT sentiment and facial emotion for holistic sentiment analysis.
- Real-time DeepFace Integration: Implement streaming video frame processing to continuously analyze facial expressions in the backend.
- Speaker Diarization: Integrate
pyannote.audio's full diarization capabilities to identify different speakers in a conversation. - Enhanced UI/UX: Further refine the user interface for clearer visualization of multi-modal insights and historical data.
- Speech-to-Text Library (Post-Hackathon): Package the robust real-time STT component into a reusable Python library for wider adoption and developer convenience.