[2025 ICT Award Korea 대학부 - 한국정보처리학회 대상]
"Hear the road, see the road." VisualVroom is an innovative wearable application that pairs smartphones and smartwatches to provide deaf drivers with real-time visual and haptic alerts for traffic sounds. Using AI-powered audio analysis, the app detects emergency vehicles, motorcycles, and car horns while determining their direction, delivering critical safety information through visual cues and vibration patterns.
- Vehicle Type Detection: Distinguishes between sirens, motorcycles, and car horns
- Directional Awareness: Uses smartphone stereo microphones to identify sound direction (left/right)
- Real-time Processing: Continuous audio monitoring with instant alerts
- Live Transcription: Converts speech to text using Google Speech-to-Text
- Sign Language Generation: Creates sign language images using Google Gemini
- Accessibility Support: Helps deaf drivers communicate with law enforcement and others
- Audio Capture: Stereo microphones capture ambient sound
- Feature Extraction:
- Generate spectrograms for frequency analysis
- Extract MFCC (Mel-Frequency Cepstral Coefficients) features
- Stitch features into a unified image representation
- AI Classification: Vision Transformer (ViT) model processes the audio-visual representation
- Direction Detection: Amplitude analysis determines left/right orientation
- Alert Delivery: Results sent to smartwatch for haptic and visual feedback
- Language: Java
- IDE: Android Studio
- Framework: Android SDK (API Level 30+)
- Wearable: WearOS by Google
- UI Components:
- Lottie animations
- Material Design components
- ViewPager2 for tabbed interface
- Framework: FastAPI (Python)
- AI/ML:
- PyTorch with Vision Transformer (ViT)
- librosa for audio processing
- Whisper AI for speech-to-text
- Audio Processing:
- soundfile, pydub for audio manipulation
- numpy for numerical operations
- Infrastructure: Google Compute Engine
- Google Speech-to-Text: Audio transcription
- Google Gemini: Sign language image generation
- Google Wearable API: Watch communication
mobile/
├── src/main/java/edu/skku/cs/visualvroomandroid/
│ ├── MainActivity.java # Main activity with tab navigation
│ ├── AudioRecorderFragment.java # Sound detection interface
│ ├── SpeechToTextFragment.java # Speech-to-sign conversion
│ ├── AudioRecorder.java # Audio recording logic
│ ├── AudioRecordingService.java # Background audio service
│ ├── WearNotificationService.java # Watch communication
│ └── dto/ # Data transfer objects
├── src/main/res/
│ ├── layout/ # UI layouts (portrait/landscape)
│ ├── raw/ # Lottie animation files
│ └── values/ # App resources
└── AndroidManifest.xml # App permissions and services
wear/
├── src/main/java/edu/skku/cs/visualvroomandroid/presentation/
│ └── MainActivity.java # Watch app main activity
├── src/main/res/layout/
│ └── activity_main.xml # Watch UI layout
└── AndroidManifest.xml # Watch app manifest
backend/
└── main.py # FastAPI server with:
# - ViT model inference
# - Audio processing pipeline
# - Whisper transcription
# - API endpoints
- Android Studio (latest version)
- Android device with API level 30+
- WearOS smartwatch (optional but recommended)
- Python 3.8+ (for backend development)
- Clone the frontend repository:
git clone https://github.com/GDG-SKKU/VisualVroom_Android_GDG.git
cd VisualVroom_Android_GDG-
Open in Android Studio and build the project
-
Grant required permissions:
- Microphone access
- Location access
- Notification permissions
- Clone the backend repository:
git clone https://github.com/GDG-SKKU/VisualVroom_Backend_GDG.git
cd VisualVroom_Backend_GDG- Install dependencies:
pip install -r requirements.txt- Run the FastAPI server:
python main.py-
Sound Detection Mode:
- Launch the app and navigate to "Audio Recorder" tab
- Tap the microphone button to start continuous monitoring
- Visual alerts appear on phone, haptic feedback on watch
-
Speech-to-Sign Mode:
- Navigate to "Speech to Text" tab
- Tap record button and speak
- View transcribed text and generated sign language images
- Sample Rate: 16kHz for speech, 48kHz for sound detection
- Channels: Stereo recording for directional detection
- Processing Interval: 3-second windows for continuous monitoring
- Confidence Threshold: 97% for high-accuracy alerts
- Architecture: Vision Transformer (ViT-B/16)
- Classes: 6 total (Siren_L, Siren_R, Bike_L, Bike_R, Horn_L, Horn_R)
- Input Size: 224x224 grayscale images
- Checkpoint:
feb_25_checkpoint.pth
- Free Tier: 0-60 minutes per month
- Paid Tier: $0.016/minute beyond 60 minutes
- Monthly Estimate: ~$4 per driver (based on 300 minutes usage)
- Cost: Free for sign language generation (as assumed)
