MedKit is a hands-free first-aid coaching application that transforms Meta Ray-Ban smart glasses into a real-time medical assistance system. The app provides voice-guided, AI-powered first-aid instructions to bystanders during medical emergencies, helping them take immediate action while waiting for emergency services.
"Point-of-care first-aid guidance in your ear, triggered by what you're seeing, with timers and rhythm assistance."
The app uses the wearer's point-of-view video and voice to understand emergency situations and provides step-by-step guidance through the glasses' speakers, with visual aids displayed on a connected iPhone.
- Voice-activated: Wake word "Medkit" activates the system
- Natural conversation: AI responds to user questions and provides guidance
- Scene analysis: Continuously analyzes video frames to understand the situation
- Proactive check-ins: Automatically checks in if user goes quiet during emergencies
Supports three primary emergency scenarios:
- CPR Assistance: Guides through cardiopulmonary resuscitation with metronome timing
- Severe Bleeding: Provides wound care instructions and pressure application guidance
- Adult Choking: Guides through Heimlich maneuver and airway clearing
- Metronome: Audio beats at 110 BPM for CPR compressions
- Timers: Countdown timers for pressure checks, rescuer switches, etc.
- Visual Checklists: Step-by-step instructions displayed on phone
- 3D Wireframe Guides: Animated body guides showing where to focus (chest, arm, etc.)
- Video Recording: Automatically records session video, exports as MP4
- Transcript Logging: Captures all conversations with timestamps
- PDF Export: Generates formatted transcript PDFs with session metadata
- EMS Reports: Creates comprehensive text reports for emergency medical services
- Face Blurring: MediaPipe automatically blurs faces in video frames
- Safety Disclaimers: Always displays "Decision support only - call emergency services"
- Confidence Gating: Only provides instructions when confident about the situation
- No Diagnosis: System explicitly avoids medical diagnosis
Technology Stack:
- SwiftUI for user interface
- Meta Ray-Ban SDK (MWDATCamera, MWDATCore) for glasses integration
- AVFoundation for audio/video processing
- WebSocket for real-time backend communication
Key Components:
MetaCameraView: Main UI with streaming interfaceStreamViewModel: Manages streaming session and stateAudioManager: Handles audio capture, wake word detection, playbackWebSocketManager: Manages backend communicationToolExecutor: Executes metronome, timers, UI cards locallySessionLogger: Records video, transcripts, generates exportsExportView: UI for exporting session data
Features:
- Real-time video streaming from glasses
- Audio capture with wake word detection
- Local tool execution (metronome, timers)
- 3D wireframe visualization for body regions
- Session recording and export capabilities
Technology Stack:
- Modal for cloud hosting
- FastAPI for WebSocket gateway
- OpenAI Realtime API for voice conversation
- GPT-4o Vision for scene analysis
- Python async/await for concurrent processing
Key Components:
app.py: Modal deployment configurationorchestrator.py: Core orchestration engine managing four concurrent loops:- iOS → Realtime: Audio/frames from client
- Realtime → iOS: AI responses, transcripts, tools
- Scene Analysis Loop: Periodic VLM analysis
- Follow-up Loop: Proactive check-ins during emergencies
dedalus_agent.py: Scene analysis using GPT-4o Visionsession_logger.py: Backend logging and report generationprompts.py: System prompts for AI behaviortools.py: Tool definitions (metronome, timers, UI cards)
Architecture Flow:
iOS App → WebSocket → Modal Backend → OpenAI Realtime API
↓
Scene Analysis (GPT-4o Vision)
↓
Tool Execution → iOS App
- User opens app and connects Meta Ray-Ban glasses
- Taps "Start Session" button
- App requests camera and microphone permissions
- Establishes WebSocket connection to Modal backend
- Begins streaming video frames (every 3 seconds) and audio
- User says "Medkit" (or variations like "med kit", "medic")
- Speech recognition detects wake word
- System activates and starts listening
- Audio streams to backend for processing
- User describes situation: "Someone collapsed!"
- AI asks clarifying questions: "Are they responding? Are they breathing?"
- Scene analysis: VLM analyzes video frames every 8 seconds
- Scenario identification: System determines emergency type (CPR, bleeding, choking)
- Confidence check: Only proceeds if confident or user confirms
- Initial instruction: "Call emergency services now"
- Tool activation:
- Metronome starts for CPR (110 BPM)
- Timer starts for pressure checks or rescuer switches
- Checklist appears on phone screen
- Step-by-step guidance: AI provides next steps via voice
- Visual aids: 3D wireframe highlights relevant body region
- Proactive check-ins: System checks in if user is quiet
- During session: Video frames recorded, transcripts logged with timestamps
- After session: User can export:
- Video (MP4)
- Transcript PDF
- EMS Report (TXT)
- Video: Frames from glasses camera (sampled every 3 seconds)
- Audio: User voice from glasses microphone
- Scene Context: Current scenario state, recent transcripts
- Voice → Text: OpenAI Realtime API transcribes user speech
- Vision Analysis: GPT-4o Vision analyzes video frames
- Decision Making: AI coordinator integrates voice + vision to determine actions
- Tool Execution: Commands sent to iOS app for local execution
- Audio Response: AI voice guidance through glasses speakers
- Visual UI: Checklists, timers, wireframes on phone
- Audio Tools: Metronome beats, timer alerts
- Session Logs: Video, transcripts, reports
- Purpose: Natural voice conversation
- Features: Streaming audio, real-time transcription, low latency
- Voice: "alloy" voice model
- Format: PCM16, 24kHz audio
- Purpose: Scene analysis from video frames
- Frequency: Every 8 seconds
- Output: Factual scene descriptions (1-2 sentences)
- Detail Level: Currently "low" (can upgrade to "high")
- Purpose: Decision making, scenario management, tool execution
- Capabilities:
- Integrates voice + vision inputs
- Maintains scenario state
- Executes tools (metronome, timers, UI)
- Enforces safety rules
- Always recommends calling 911 for any emergency
- Confidence gating: Only provides instructions when confident
- No diagnosis: Explicitly avoids medical diagnosis
- Playbook-based: Only provides established first-aid procedures
- User confirmation: Asks clarifying questions before acting
- Safety disclaimers: Always visible in UI
- Face blurring: MediaPipe blurs faces before cloud upload
- Local storage: Session data stored locally on device
- User control: User decides what to export/share
- No persistent video: Only processes frames, doesn't store full video
- CPR: Unresponsive person, not breathing normally
- Severe Bleeding: Heavy external bleeding
- Adult Choking: Airway obstruction
- Burns
- Fractures
- Allergic reactions
- Wound care
- Minor injuries
- Status Bar: Shows connection status, wake word state, scenario type
- Video Feed: Live view from glasses camera
- Audio Visualizer: Visual feedback when listening
- 3D Wireframe: Animated guide showing body regions
- Transcript Display: Real-time conversation transcript
- Tool Overlays: Metronome, timers, checklists
- Export Button: Access to session exports
- Wake Word Mode: System waits for "Medkit" activation
- Emergency Mode: System stays active during critical scenarios
- Conversation Mode: Natural back-and-forth dialogue
- Guidance Mode: Step-by-step instruction delivery
- Automatic Video Recording: Records entire session as MP4
- Transcript Logging: Captures all conversations with precise timestamps
- PDF Export: Formatted transcripts with session metadata
- EMS Report Generation: Comprehensive reports for medical professionals
- Backend Logging: Server-side logging for redundancy
- Video (MP4): Full session recording, 30 FPS, H.264 encoding
- Transcript PDF: Includes session info, scenarios, scene observations, full conversation
- EMS Report: Text report with session details, key information, tool calls
- Video Frame Rate: 24 FPS from glasses, sampled every 3 seconds
- Audio: PCM16, 24kHz, mono
- Latency: <3 seconds for voice responses
- Scene Analysis: Every 8 seconds
- Wake Word: ~15 second timeout if inactive
- Video: ~2MB per minute (640x480, H.264)
- Transcripts: Minimal storage (text)
- Session Logs: JSON format, ~10-50KB per session
- iOS: iOS 15+
- Hardware: Meta Ray-Ban smart glasses
- Network: Internet connection for backend
- Permissions: Camera, microphone, speech recognition
Bystander encounters medical emergency → Activates MedKit → Receives real-time guidance → Takes action while waiting for EMS
- Training: Practice first-aid procedures with AI guidance
- Review: Analyze session transcripts to improve responses
- Documentation: Generate reports for medical professionals
- Education: Learn proper first-aid techniques
- Cloud storage for automatic backup
- Multi-language support
- Pediatric emergency support
- Integration with emergency services
- Offline mode with on-device models
- Advanced analytics dashboard
- Custom scenario training
- ✅ Core streaming infrastructure
- ✅ Wake word detection
- ✅ Scene analysis integration
- ✅ Tool execution (metronome, timers, UI)
- ✅ Session logging and export
- ✅ Backend orchestration
- ✅ Safety features
- 🔄 VLM model optimization
- 🔄 Enhanced scene analysis accuracy
- 🔄 Additional emergency scenarios
- 🔄 Performance optimizations
MedKit is a comprehensive first-aid assistance system that combines:
- Wearable technology (Meta Ray-Ban glasses)
- AI-powered guidance (OpenAI Realtime + Vision)
- Real-time scene analysis (GPT-4o Vision)
- Interactive tools (metronome, timers, checklists)
- Complete session logging (video, transcripts, reports)
The system helps non-expert bystanders provide effective first-aid during emergencies while maintaining safety, privacy, and comprehensive documentation for medical professionals.