Sera is an AI system that connects real-time vision with natural language understanding. It observes your surroundings, builds a memory of the environment, and lets you interact with it through voice or text.
(Note: This video and images are for reference and does not has anything to do with the project's outcomes!)
Vizual Reference: Click Here
- Real-time video processing from smartphone or webcam
- Object tracking and spatial memory across time
- Natural language interface to query stored visual context
- 3D visualization of the room or environment
- Lightweight models using transfer learning for vision and speech
- Modular architecture for easy extension and experimentation
Ask questions like:
- "Where did I place my glasses yesterday?"
- "How many books are on the table right now?"
- "What was on the desk last night?"
- Walk through a room and view its 3D memory map on your laptop
- Python, PyTorch or TensorFlow
- OpenCV for video input and object tracking
- YOLO/Segment Anything for perception (fine-tuned)
- Whisper or SpeechT5 for voice input
- Flask/FastAPI backend + WebSocket for real-time streaming
- Three.js/Blender/Unity for 3D scene visualization
Sera/
│
├── models/ # Vision, language and memory modules
├── data/ # Sample datasets and recorded sessions
├── src/
│ ├── vision/ # Object detection, tracking, scene mapping
│ ├── memory/ # Spatial and temporal memory system
│ ├── interface/ # Voice and text query processing
│ ├── server/ # API and real-time streaming logic
│ └── ui/ # 3D visualization frontend
│
├── requirements.txt
├── README.md
└── LICENSE
- Video stream from phone or webcam is sent to the system
- Objects are detected, tracked, and positioned in 3D space
- A memory module stores each object's location and time context
- Users ask questions using voice or text
- Natural language query is matched to visual-memory data and answered
- Personal object recognition and face identification
- Multi-room or outdoor mapping
- Voice assistant integration with smart devices
- On-device processing for privacy-focused use
Contributions, ideas, and research suggestions are welcome. Fork the repo, open an issue, or submit a pull request.
This project is licensed under the MIT License.