A patient-focused speech-to-intent assistive system designed for stroke/aphasia patients. The system converts short audio recordings into actionable intents, enabling caregivers to quickly understand and respond to patient needs.
Built for Microsoft Imagine Cup - Powered by HuBERT & Wav2Vec2 models on Azure ML
- Overview
- Features
- System Architecture
- Supported Intents
- Tech Stack
- Getting Started
- API Documentation
- Audio Requirements
- Project Structure
- Configuration
This assistive system enables patients with speech difficulties to communicate their needs through simple voice commands. The system:
- Records short audio clips (1-3 seconds) from patients
- Processes audio using HuBERT/Wav2Vec2 deep learning models on Azure ML
- Detects intent using embedding-based classification with cosine similarity
- Displays actionable buttons for caregivers to confirm and respond
- Real-time Speech Processing - Process patient speech in under 2 seconds
- Dual Model Architecture - HuBERT primary with Wav2Vec fallback for reliability
- Learning Loop - System learns from confirmed intents to improve accuracy
- Neumorphic UI - Accessible, visually clear interface with dark/light themes
- Low Confidence Handling - Shows alternatives when intent is unclear
- Visual Feedback - Animated system diagram showing processing stages
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ │ │ │ │ │
│ React App │────▶│ FastAPI │────▶│ Azure ML │
│ (Frontend) │ │ Backend │ │ (HuBERT/ │
│ │◀────│ │◀────│ Wav2Vec2) │
│ │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ Audio Recording │ Intent Detection
│ (Web Audio API) │ + Embedding Storage
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ 16kHz Mono WAV │ │ Intent DB │
│ Max 3 seconds │ │ (JSON-based) │
└─────────────────┘ └─────────────────┘
- Patient presses "Speak" on the frontend interface
- Frontend records audio using Web Audio API (auto-stops at 3 seconds)
- Audio is converted to 16kHz mono WAV format
- Frontend sends WAV file to
POST /api/audio - Backend calls Azure ML for speech-to-intent processing
- Backend returns intent with confidence score and UI options
- Patient/Caregiver confirms the detected action
| Intent | Description | UI Options |
|---|---|---|
HELP |
Patient needs assistance | Confirm Help, Cancel |
EMERGENCY |
Urgent medical situation | Cancel Emergency |
WATER |
Patient needs hydration | Confirm Water, Cancel |
PAIN |
Patient is in discomfort | Confirm Pain, Where?, Cancel |
BATHROOM |
Toileting needs | Confirm Bathroom, Cancel |
TIRED |
Rest/sleep needed | Confirm Rest, Cancel |
COLD |
Temperature - feels cold | Confirm Cold, Cancel |
HOT |
Temperature - feels hot | Confirm Hot, Cancel |
YES |
Affirmative confirmation | OK |
NO |
Negative/cancellation | OK |
- FastAPI - Modern Python web framework
- Azure ML - HuBERT & Wav2Vec2 model hosting
- FAISS - Vector similarity search for intent matching
- NumPy - Embedding computations
- httpx - Async HTTP client for Azure ML calls
- React 19 - UI framework
- Vite - Build tool and dev server
- Tailwind CSS 4 - Utility-first styling
- Framer Motion - Animations and transitions
- Radix UI - Accessible dialog and toast components
- Axios - HTTP client
- Python 3.11+
- Node.js 18+ and npm
- Azure ML endpoints configured with HuBERT/Wav2Vec models
-
Navigate to backend directory:
cd Backend -
Create virtual environment:
python -m venv venv source venv/bin/activate # Linux/Mac # or .\venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Configure environment variables:
Create a
.envfile in theBackenddirectory:# Azure ML - HuBERT (Primary) REST_END_POINT__HUBERT=https://your-hubert-endpoint.azureml.net/score PRIMARY_KEY__HUBERT=your_hubert_api_key # Azure ML - Wav2Vec (Fallback) REST_END_POINT__WAVE2VEC=https://your-wav2vec-endpoint.azureml.net/score PRIMARY_KEY__WAVE2VEC=your_wav2vec_api_key # Server Config (Optional) HOST=127.0.0.1 PORT=8000 DEBUG=false
-
Run the backend server:
uvicorn main:app --reload --host 127.0.0.1 --port 8000
-
Verify the API is running:
- OpenAPI Docs: http://127.0.0.1:8000/docs
- ReDoc: http://127.0.0.1:8000/redoc
- Health Check: http://127.0.0.1:8000/api/health
-
Navigate to frontend directory:
cd frontend -
Install dependencies:
npm install
-
Configure API endpoint (optional):
Create a
.envfile:VITE_API_URL=http://127.0.0.1:8000
-
Run the development server:
npm run dev
-
Access the application:
Open http://localhost:5173 in your browser
Process audio and detect intent.
Request:
- Content-Type:
multipart/form-data - Body:
audio- WAV file (16kHz, mono, max 3s, max 1MB)
Response:
{
"intent": "WATER",
"confidence": 0.89,
"status": "confirmed",
"ui_options": ["Confirm Water", "Cancel"],
"next_action": "show_buttons",
"transcription": null,
"alternatives": null,
"embedding_id": "uuid-here",
"model_used": "HuBERT"
}Check system health and ML endpoint status.
Response:
{
"status": "ok",
"ml_endpoints": {
"hubert": {"reachable": true},
"wave2vec": {"reachable": true}
}
}Get list of available intents.
Get embedding database statistics.
| Parameter | Value |
|---|---|
| Format | WAV |
| Sample Rate | 16000 Hz (16kHz) |
| Channels | Mono (1 channel) |
| Bit Depth | 16-bit |
| Max Duration | 3 seconds |
| Max File Size | 1 MB |
Imagine_cup_Backend/
├── Backend/
│ ├── main.py # FastAPI application entry point
│ ├── requirements.txt # Python dependencies
│ ├── intent_embeddings.json # Stored intent embeddings (learning)
│ ├── .env # Environment variables (create this)
│ └── app/
│ ├── config.py # Application settings
│ ├── models/
│ │ └── schemas.py # Pydantic request/response models
│ ├── routes/
│ │ ├── audio.py # Audio processing endpoints
│ │ └── health.py # Health check endpoints
│ └── services/
│ ├── azure_ml.py # Azure ML integration
│ ├── intent_embeddings.py # Embedding-based classification
│ ├── intent_logic.py # Intent detection logic
│ └── logger.py # Logging utilities
│
└── frontend/
├── package.json # NPM dependencies
├── vite.config.js # Vite configuration
├── tailwind.config.js # Tailwind CSS config
└── src/
├── App.jsx # Root React component
├── main.jsx # Application entry point
├── index.css # Global styles
├── components/
│ ├── app/ # Application-specific components
│ │ ├── RecordButton.jsx
│ │ ├── IntentCard.jsx
│ │ ├── ActionButtons.jsx
│ │ ├── DiagramLayout.jsx
│ │ ├── ListeningWave.jsx
│ │ └── QuickActions.jsx
│ └── ui/ # Reusable UI components
│ ├── button.jsx
│ ├── card.jsx
│ ├── dialog.jsx
│ ├── toast.jsx
│ └── ThemeToggle.jsx
├── hooks/
│ ├── useAudioRecorder.js # Audio recording hook
│ └── use-toast.js # Toast notifications
├── pages/
│ └── Home.jsx # Main application page
└── services/
└── api.js # Backend API client
| Setting | Default | Description |
|---|---|---|
MAX_AUDIO_SIZE_BYTES |
1048576 | Maximum audio file size (1 MB) |
MAX_AUDIO_DURATION_SECONDS |
3 | Maximum recording duration |
SAMPLE_RATE |
16000 | Required audio sample rate |
AZURE_ML_TIMEOUT_SECONDS |
120 | ML endpoint timeout |
CONFIDENCE_CONFIRMED |
0.75 | Threshold for auto-confirmation |
CONFIDENCE_NEEDS_CONFIRMATION |
0.4 | Threshold for showing alternatives |
- ≥ 0.75: Intent confirmed, show action buttons
- 0.4 - 0.75: Needs confirmation, show alternatives
- < 0.4: Unknown, prompt to repeat
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is developed for the Microsoft Imagine Cup competition.
- Microsoft Azure ML for model hosting infrastructure
- HuBERT and Wav2Vec2 research teams at Meta AI
- Radix UI for accessible component primitives
- Tailwind CSS for rapid UI development