Skip to content

Mohammed-Faizzzz/DepoSense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DepoSense: Real-time Multi-Modal Sentiment Analysis

Project Overview

DepoSense aims to revolutionize the traditional legal deposition process. Currently, depositions are recorded for post-mortem review, but the live session often lacks real-time analytical support. DepoSense enhances both the active deposition and the subsequent review by providing immediate, AI-driven insights.

Our software processes multi-modal streaming data from a web browser to a Python backend, delivering actionable intelligence back to the user interface in real-time.

Key Components & Features:

  • Real-time Audio Analysis:

    • Live Transcription: Converts spoken dialogue into text instantly.
    • Contradiction Detection: Flags inconsistencies in the transcribed script on the fly.
    • Key Moment Highlighting: Identifies and marks significant conversational points for streamlined post-mortem review.
  • Real-time Video Analysis:

    • Emotional Cue Detection: Processes facial expressions and body language to analyze emotional states.
    • Contextual Follow-up Questions: Integrates emotional cues with conversational content to suggest pertinent follow-up questions in real time, maximizing deposition effectiveness.

This project is built for a hackathon, demonstrating a prototype for a multi-modal AI pipeline capable of processing streaming data from a web browser to a Python backend, and delivering real-time insights back to the user interface. This prototype leverages innovative open-source contributions, including VoiceStreamAI for real-time audio processing and transcription, and DeepFace for advanced facial analysis.

Key Features

  • Real-time Speech-to-Text (STT): Utilizes faster-whisper for efficient, high-quality audio transcription from live microphone input.
  • Voice Activity Detection (VAD): Employs pyannote.audio to intelligently detect speech segments, ensuring only relevant audio is processed for transcription.
  • Real-time Facial Emotion Detection: Integrates DeepFace to analyze facial expressions from a live webcam feed and classify emotions (e.g., happy, sad, angry, neutral).
  • Multi-Modal Insights (Prototype): The foundation is laid for combining insights from both audio and visual streams to derive a more comprehensive understanding of communication.
  • Seamless Web Interface: A user-friendly web interface allows easy connection to the backend, live audio/video streaming control, and real-time display of transcription and facial emotion data.

Project Structure

This outlines the main components and their organization within the DepoSense repository:

DepoSense/
├── VoiceStreamAI/           # Integrated VoiceStreamAI repository (for audio transcription)
│   ├── client/                  # VoiceStreamAI's Frontend (HTML, CSS, JavaScript)
│   │   ├── index.html           # Main web client application for audio ASR
│   │   ├── realtime-audio-processor.js # AudioWorklet for efficient audio processing
│   │   └── utils.js             # Frontend utility functions
│   └── src/                     # VoiceStreamAI's Python Backend for ASR and VAD
│       ├── main.py              # VoiceStreamAI server entry point
│       ├── asr/                 # ASR (Automatic Speech Recognition) pipeline components
│       │   ├── faster_whisper_asr.py # Faster-Whisper ASR integration
│       │   └── asr_factory.py     # ASR pipeline factory
│       ├── vad/                 # VAD (Voice Activity Detection) pipeline components
│       │   ├── pyannote_vad.py    # Pyannote VAD integration
│       │   └── vad_factory.py     # VAD pipeline factory
│       └── config.py            # VoiceStreamAI server configuration
├── video_analysis/          # Custom Python module for real-time video analysis
│   └── video_analysis.py    # Script for webcam processing and emotion analysis (using DeepFace)
├── output_audio.wav         # Example audio file for testing (if applicable)
├── .env                     # Environment variables (e.g., Google Cloud credentials)
├── .venv/                   # Python virtual environment for DepoSense project (created upon setup)
├── requirements.txt         # Consolidated Python dependencies for the entire DepoSense project
└── README.md                # This project's main documentation file

How to Run DepoSense (Local Setup)

This guide assumes you have Python 3.9+ and pip installed.

1. Clone the Repository

First, clone the DepoSense repository to your local machine:

git clone https://github.com/Mohammed-Faizzzz/DepoSense.git
mv VoiceStreamAI DepoSense
cd DepoSense

Then, clone the VoiceStreamAI repository and move it to the DepoSense folder.

git clone https://github.com/alesaccoia/VoiceStreamAI.git # If you're starting from the base repo
mv VoiceStreamAI DepoSense # Rename to your project name
cd DepoSense

2. Set Up the Python Backend

Repeat the following steps in the 2 directories: ./DepoSense and ./VoiceStreamAI.

A. Create and Activate a Virtual Environment

It's highly recommended to use a Python virtual environment to manage dependencies.

python3 -m venv .venv        # Create a new virtual environment
source .venv/bin/activate    # Activate the virtual environment (on macOS/Linux)
                             # On Windows, use: .venv\Scripts\activate

Your terminal prompt should now show (.venv) at the beginning, indicating the environment is active.

B. Install Python Dependencies

Now, install the dependencies within your active virtual environment:

pip install -r requirements.txt

Note on torch for macOS (Apple Silicon): If you experience issues or want MPS acceleration, please refer to the official PyTorch website (https://pytorch.org/get-started/locally/) for the precise pip install command for your system.

C. Configure Hugging Face Token for pyannote.audio

The pyannote/segmentation model used for VAD is gated and requires a Hugging Face token.

  1. Agree to User Conditions: Go to https://huggingface.co/pyannote/segmentation and accept the user conditions.
  2. Generate a Token: Go to your Hugging Face settings (https://huggingface.co/settings/tokens), click "New token", give it a name (e.g., depo_sense), and select the "read" role. Copy the token immediately!
  3. Log in via CLI (Recommended):
    pip install huggingface_hub # If not already installed
    huggingface-cli login       # Paste your token when prompted
    Alternatively, you can set an environment variable: export HUGGING_FACE_HUB_TOKEN="hf_YOUR_TOKEN_HERE".

D. Adjust Backend Code for CPU (if running on a non-GPU machine like MacBook Pro)

Ensure your ASR and VAD models are configured to run on your CPU.

  1. Open src/asr/faster_whisper_asr.py:
    • Locate the WhisperModel initialization (around line 117).
    • Ensure device="cpu" and compute_type="float32" are explicitly set:
      self.asr_pipeline = WhisperModel(
          model_size,
          device="cpu", # Ensure this is 'cpu' for non-GPU machines
          compute_type="float32", # Use float32 for CPU efficiency
          **kwargs
      )
  2. Open src/vad/pyannote_vad.py:
    • Change the import:
      from pyannote.audio.pipelines.speaker_verification import VoiceActivityDetection
    • In the PyannoteVAD class's __init__, ensure the model is moved to CPU:
      import torch # Add this import if not present
      # ... inside __init__
      self.vad_pipeline = VoiceActivityDetection(segmentation="pyannote/segmentation")
      self.device = torch.device("cpu")
      self.vad_pipeline.to(self.device) # Move pipeline to CPU

E. Clear Pyannote Model Cache (Optional, if warnings persist)

If you still see warnings about pyannote.audio 0.0.1 after updating pyannote.audio to 3.2.0, delete the old cached model:

rm -rf ~/.cache/torch/pyannote/models--pyannote--segmentation
# Or if it's in the Hugging Face Hub cache:
# rm -rf ~/.cache/huggingface/hub/models--pyannote--segmentation

F. Run the Backend Server

With your (.venv) active, from the DepoSense/ root directory:

python3 -m src.main --vad-args '{"auth_token": "HF_TOKEN"}'

You should see output indicating the WebSocket server is ready (e.g., WebSocket server ready to accept secure connections on 127.0.0.1:8765). This should be ideally run on a CUDA friendly machine so as to get realtime transcription.

3. Set Up and Run the Frontend

The frontend needs to be served by a local web server to bypass browser security restrictions (CORS).

A. Open a New Terminal Window

Do NOT close the terminal running your Python backend server.

B. Navigate to the Frontend Directory

cd client/

C. Start a Simple HTTP Server

python3 -m http.server 8000

You should see output like Serving HTTP on :: port 8000 (http://[::]:8000/) ....

D. Access the Web Interface

  1. Open your web browser.
  2. Go to http://localhost:8000/index.html (or simply http://localhost:8000/).
  3. In the web interface, ensure the "WebSocket Address" is set to ws://localhost:8765.
  4. Click "Connect" and then "Start Streaming".
  5. Allow microphone access when prompted.

You should now see real-time transcriptions appear! (And if you implement the DeepFace integration, facial analysis as well).

Future Enhancements & Vision

  • Robust Multi-Modal Fusion: Develop advanced algorithms to combine STT sentiment and facial emotion for holistic sentiment analysis.
  • Real-time DeepFace Integration: Implement streaming video frame processing to continuously analyze facial expressions in the backend.
  • Speaker Diarization: Integrate pyannote.audio's full diarization capabilities to identify different speakers in a conversation.
  • Enhanced UI/UX: Further refine the user interface for clearer visualization of multi-modal insights and historical data.
  • Speech-to-Text Library (Post-Hackathon): Package the robust real-time STT component into a reusable Python library for wider adoption and developer convenience.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages