Advanced AI/ML Backend Service for Emotion Recognition & Multi-Modal Analysis
Powering the SoulSync application with cutting-edge machine learning capabilities
SoulSync AI Engine is a sophisticated machine learning backend service that provides real-time emotion recognition and analysis across multiple modalities. Built with FastAPI and Python 3.11, it serves as the intelligent core of the SoulSync ecosystem, processing audio, video, and textual data to understand human emotions and preferences.
- 🎵 Music Emotion Analysis - Extract emotional features from audio files using advanced signal processing
- 👤 Human Emotion Detection - Multi-modal emotion recognition combining facial expressions and voice analysis
- NEW: Deep Learning CNN for face emotion recognition (7 emotions)
- NEW: LSTM-CNN hybrid for voice emotion recognition
- NEW: Intelligent fusion with confidence weighting
- 🗣️ Voice Emotion Recognition - Real-time analysis of vocal patterns and emotional states
- 🌐 Language Detection - Automatic language identification from audio inputs
- 📷 Face Detection & Analysis - Advanced facial recognition using OpenCV
- 🔄 Real-time Processing - Asynchronous processing for optimal performance
- 🌍 Cross-Origin Support - CORS-enabled API for seamless frontend integration
🚀 Enhanced Emotion Detection (March 2026)
- Upgraded from basic heuristics to state-of-the-art deep learning models
- Face Emotion: 4-block CNN with BatchNorm (65-70% accuracy on FER2013)
- Voice Emotion: Hybrid CNN-BiLSTM with 180-dimensional audio features
- Intelligent Fusion: Confidence-weighted multi-modal integration
- Low Latency: All models run locally (150-300ms total processing)
- 7 Emotions: Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral
- See EMOTION_MODEL_IMPROVEMENTS.md for details
- Backend Framework: FastAPI (async/await support)
- Python Version: 3.11.14 (optimized for ML packages)
- Environment Management: Conda
- API Documentation: OpenAPI/Swagger + ReDoc
- Audio Processing: Librosa, SoundFile
- Computer Vision: OpenCV, MediaPipe
- Data Science: NumPy, Pandas, Scikit-learn
- Feature Extraction: MFCC, Chroma, Spectral Analysis
- Model Support: Ready for PyTorch, TensorFlow, ONNX
- Database: PyMongo (MongoDB integration)
- File Handling: aiofiles (async file operations)
- HTTP Client: HTTPx
- Configuration: python-dotenv
- Deployment: Uvicorn ASGI server
ai_engine/
│
├── 📱 app/ # Main application directory
│ ├── 🚀 main.py # FastAPI application entry point
│ ├── 📍 routers/ # API route definitions
│ │ ├── 🎵 music_router.py # Music emotion analysis endpoints
│ │ ├── 👤 human_router.py # Human emotion detection endpoints
│ │ └── 🗣️ language_router.py # Language detection endpoints
│ │
│ ├── 🧠 services/ # Business logic & ML services
│ │ ├── 🎵 music_emotion_service.py # Audio feature extraction & analysis
│ │ ├── 👤 human_emotion_service.py # Multi-modal emotion recognition
│ │ ├── 🗣️ voice_emotion_service.py # Voice pattern analysis
│ │ ├── 📷 face_emotion_service.py # Facial expression analysis
│ │ └── 🌐 language_service.py # Language identification
│ │
│ ├── 📊 models/ # Data models & schemas
│ ├── 🛠️ utils/ # Utility functions
│ └── ⚙️ config/ # Configuration files
│ └── settings.py # Application settings
│
├── 📋 requirements.txt # Python dependencies
├── 🚀 activate.sh # Environment activation script
├── 🤖 face_detection_short_range.tflite # MediaPipe model file
└── 📖 README.md # This comprehensive guide
# 1. Install all dependencies
./install.sh
# OR manually:
# pip install -r requirements.txt
# 2. Setup emotion detection models
python setup_models.py
# 3. Verify installation
python test_system.py
# 4. Start the server
python -m uvicorn app.main:app --reload# Make the activation script executable
chmod +x activate.sh
# Activate environment and see all available commands
source activate.sh
# Start the development server
python app/main.py# Activate the conda environment
conda activate soulsync_ai
# Navigate to app directory
cd app
# Start the server
python main.py
# Alternative: Use uvicorn directly
uvicorn main:app --reload --host 0.0.0.0 --port 8000# Production server with optimized settings
uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 4- Conda/Miniconda installed on your system
- Python 3.11+ (handled by conda environment)
- macOS/Linux/Windows compatibility
| Setting | Value |
|---|---|
| Environment Name | soulsync_ai |
| Python Version | 3.11.14 |
| Location | /Users/vignesvm/miniconda3/envs/soulsync_ai |
| Package Manager | pip (within conda env) |
fastapi>=0.108.0 # Modern, high-performance web framework
uvicorn>=0.24.0 # Lightning-fast ASGI server
pydantic>=2.5.0 # Data validation using Python type hints
python-multipart>=0.0.6 # Multipart form-data support
httpx>=0.25.0 # Async HTTP clientnumpy>=1.24.0 # Numerical computing foundation
pandas>=2.1.0 # Data manipulation and analysis
scikit-learn>=1.3.0 # Machine learning algorithmslibrosa # Audio analysis library
soundfile # Audio file I/O
opencv-python # Computer vision librarypymongo>=4.6.0 # MongoDB driver
python-dotenv>=1.0.0 # Environment variable management
aiofiles>=23.2.0 # Asynchronous file operationsFor enhanced capabilities, install these packages as needed:
# Deep Learning Frameworks
pip install torch torchaudio # PyTorch ecosystem
pip install transformers # Hugging Face transformers
pip install tensorflow # Google's ML framework
# Advanced Computer Vision & Audio
pip install mediapipe # Google's ML framework
pip install onnxruntime # ONNX model inference- Development:
http://localhost:8000 - Production:
https://your-domain.com/api/v1
| Method | Endpoint | Description | Response |
|---|---|---|---|
GET |
/ |
Server status check | {"message": "SoulSync AI Engine Running"} |
GET |
/health |
Health monitoring | {"status": "healthy", "service": "ai_engine"} |
| Method | Endpoint | Description | Parameters |
|---|---|---|---|
POST |
/detect-song-emotion |
Analyze musical emotion | file: UploadFile (audio) |
Request Example:
curl -X POST "http://localhost:8000/detect-song-emotion" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@song.mp3"Response Example:
{
"emotion": "happy",
"confidence": 0.87,
"features": {
"tempo": 128.5,
"energy": 0.76,
"valence": 0.82
}
}| Method | Endpoint | Description | Parameters |
|---|---|---|---|
POST |
/detect-human-emotion |
Multi-modal emotion analysis | image: UploadFile, audio: UploadFile |
Request Example:
curl -X POST "http://localhost:8000/detect-human-emotion" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "image=@face.jpg" \
-F "audio=@voice.wav"Response Example:
{
"overall_emotion": "neutral",
"face_emotion": {
"emotion": "neutral",
"confidence": 0.78
},
"voice_emotion": {
"emotion": "calm",
"confidence": 0.65
},
"combined_confidence": 0.72
}| Method | Endpoint | Description | Parameters |
|---|---|---|---|
POST |
/detect-language |
Audio language identification | file: UploadFile (audio) |
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
import requests
# Analyze emotion in a music file
with open('song.mp3', 'rb') as audio_file:
response = requests.post(
'http://localhost:8000/detect-song-emotion',
files={'file': audio_file}
)
result = response.json()
print(f"Detected emotion: {result['emotion']}")import requests
# Analyze human emotion using face and voice
with open('face.jpg', 'rb') as img_file, open('voice.wav', 'rb') as audio_file:
response = requests.post(
'http://localhost:8000/detect-human-emotion',
files={
'image': img_file,
'audio': audio_file
}
)
result = response.json()
print(f"Overall emotion: {result['overall_emotion']}")// Frontend JavaScript example
const formData = new FormData();
formData.append('file', audioFile);
fetch('http://localhost:8000/detect-language', {
method: 'POST',
body: formData
})
.then(response => response.json())
.then(data => console.log('Detected language:', data.language));This section details the mathematical models, formulas, and algorithms powering SoulSync's emotion recognition capabilities.
MFCCs are the primary features for audio emotion analysis, capturing the spectral envelope of sound.
Step 1: Pre-emphasis Filter
where
Step 2: Frame Segmentation
where
Step 3: Fast Fourier Transform (FFT)
Step 4: Mel Filter Bank
Convert frequency to Mel scale:
Apply triangular filters:
Step 5: Discrete Cosine Transform (DCT)
Typically extract 13-40 MFCC coefficients.
Represent the 12 pitch classes of the musical octave:
where
Spectral Centroid (brightness):
Spectral Rolloff (frequency below which 85% of energy is contained):
Zero Crossing Rate:
Onset Strength Envelope:
Autocorrelation for Tempo:
Tempo (BPM) is extracted from peaks in
where:
-
$W$ = learnable kernel weights -
$b$ = bias term -
$\sigma$ = activation function (ReLU)
ReLU Activation:
Normalize activations across mini-batch:
where
Reduces spatial dimensions while retaining important features.
Final layer outputs probability distribution over 7 emotions:
where
where
Combines 180 audio features:
- 40 MFCCs + deltas + delta-deltas = 120 features
- 20 Chroma features
- 20 Spectral features (centroid, rolloff, contrast, etc.)
- 20 Additional prosodic features
1D convolutions over time dimension:
Forward LSTM:
Backward LSTM computes
Combined hidden state:
Context vector
For face emotion
Normalized Weights:
where
Emotion Score Fusion:
for each emotion class
Final Emotion:
Combined Confidence:
If one modality fails (confidence < 0.3):
Voice emotions are mapped to face emotion space:
| Voice Emotion | Face Emotion |
|---|---|
| calm, neutral | neutral |
| happy, excited | happy |
| sad | sad |
| angry, frustrated | angry |
| fearful | fear |
Music emotions are represented in 2D space:
Energy Calculation:
Spectral Flux (measure of change):
Based on quadrant in valence-arousal space:
where
Median Filtering:
where
Per-class accuracy:
Default hyperparameters:
where
During training:
Typical
For Audio:
- Time stretching:
$y[n] = x[\alpha \cdot n]$ where$\alpha \in [0.8, 1.2]$ - Pitch shifting: Shift by
$\pm 2$ semitones - Add noise:
$\tilde{x}[n] = x[n] + \epsilon[n]$ ,$\epsilon \sim \mathcal{N}(0, \sigma^2)$
For Images:
- Random rotation:
$\theta \in [-15°, 15°]$ - Horizontal flip with probability 0.5
- Brightness/contrast adjustment:
$I' = \alpha I + \beta$
Configuration is managed through app/config/settings.py:
class Settings:
APP_NAME = "SoulSync AI Engine"
VERSION = "1.0.0"
# Audio processing limits
MAX_AUDIO_DURATION = 30 # seconds
MAX_VOICE_DURATION = 10 # seconds
# Emotion analysis weights
FACE_WEIGHT = 0.65 # Face emotion contribution
VOICE_WEIGHT = 0.35 # Voice emotion contributionCreate a .env file in the root directory:
# Server Configuration
HOST=0.0.0.0
PORT=8000
DEBUG=True
# Database Configuration
MONGODB_URL=mongodb://localhost:27017
DATABASE_NAME=soulsync
# ML Model Configuration
FACE_MODEL_PATH=face_detection_short_range.tflite
MIN_DETECTION_CONFIDENCE=0.5
MIN_SUPPRESSION_THRESHOLD=0.3
# Logging Configuration
LOG_LEVEL=INFO-
New API Endpoints
# Create new router in app/routers/ touch app/routers/new_feature_router.py # Register in main.py app.include_router(new_feature_router)
-
New ML Services
# Create service in app/services/ touch app/services/new_ml_service.py # Follow the existing pattern with extract_features() and predict() methods
-
Data Models
# Define Pydantic models in app/models/ touch app/models/new_model.py
- Python Style: Follow PEP 8 guidelines
- Type Hints: Use type annotations for all functions
- Async/Await: Prefer async functions for I/O operations
- Error Handling: Implement comprehensive exception handling
- Documentation: Include docstrings for all classes and functions
# Install testing dependencies
pip install pytest pytest-asyncio httpx
# Run tests
pytest tests/
# Run with coverage
pytest --cov=app tests/- Use async/await for file operations
- Implement caching for frequently used models
- Optimize audio/video processing pipelines
- Monitor memory usage with large files
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]- Use a reverse proxy (nginx) for static files
- Implement load balancing for multiple instances
- Set up proper logging and monitoring
- Configure SSL/TLS certificates
- Implement rate limiting and authentication
# Check active environment
conda info --envs
# Recreate environment if corrupted
conda remove --name soulsync_ai --all
conda create -n soulsync_ai python=3.11 -y
conda activate soulsync_ai
pip install -r requirements.txt# Force binary installation (avoid compilation)
pip install --only-binary=:all: package_name
# Clear pip cache
pip cache purge
# Use conda for problematic packages
conda install -c conda-forge package_name# Install additional audio codecs
pip install ffmpeg-python
# For macOS audio issues
brew install ffmpeg# MediaPipe installation issues
pip uninstall mediapipe
pip install mediapipe --no-deps
pip install opencv-python-
Enable Debug Logging
import logging logging.basicConfig(level=logging.DEBUG)
-
Check File Permissions
chmod 755 activate.sh ls -la temp_* # Check temporary file creation
-
Monitor Resource Usage
# Memory usage ps aux | grep python # Disk space df -h
We welcome contributions to the SoulSync AI Engine! Here's how you can help:
-
Fork the Repository
git clone https://github.com/yourusername/soulsync-ai-engine.git cd soulsync-ai-engine -
Set Up Development Environment
source activate.sh pip install -r requirements.txt -
Create Feature Branch
git checkout -b feature/amazing-new-feature
- 🐛 Bug Fixes - Help us squash bugs
- ✨ New Features - Add exciting capabilities
- 📚 Documentation - Improve our guides
- 🧪 Tests - Increase code coverage
- 🎨 UI/UX - Enhance user experience
- ⚡ Performance - Optimize algorithms
- Make your changes
- Add tests for new functionality
- Update documentation
- Submit a pull request with detailed description
⭐ If you find this project helpful, please give it a star! ⭐
Built with ❤️ by the SoulSync Team