Developed by: Sergejs Zahovskis, Dmitry Knorre, James Conant, and Yael Fassbind
Course: 2025 Mixed Reality at ETH Zurich
Supervised by: Alexander Veicht
This project enables users to interact with photorealistic 3D scenes in VR using natural language voice commands (e.g., "highlight the bonsai tree"). While standard VR applications rely on cumbersome controllers for text input, our system allows for intuitive, immersive scene manipulation directly on a Meta Quest 3.
We overcome the hardware limitations of mobile VR (8GB RAM, Adreno 740 GPU) by offloading heavy language embedding computations to a local server while maintaining real-time rendering natively on the headset.
- Real-Time Rendering: Native 3D Gaussian Splatting on VR headset at stable frame rates (~12 FPS)
- Semantic Understanding: Integration of Occam's Language Gaussian Splatting (LGS) to link 3D objects with natural language
- Voice-Driven Interaction: Speech-to-text via OpenAI's Whisper and semantic querying via CLIP (ViT-B-16)
- Multiple Rendering Modes:
- Standard colored Gaussian Splats
- Occam Similarities (grayscale with heatmap highlighting)
- Occam Colored Similarities (heatmap overlay on colored scene)
- Unity Version:
2023.1.14f1 - Hardware: Meta Quest 3 and a laptop/PC (both must be on the same Wi-Fi network)
- Python: 3.9 with virtual environment support
- Navigate to the
occam_backend/directory:
cd occam_backend/- Create a virtual environment:
python3 -m venv venv- Activate the virtual environment:
source venv/bin/activate- Install the required dependencies:
pip install -r requirements.txt-
Download the required
.plyfiles:- [occam_bonsai_mcmc.ply - https://polybox.ethz.ch/index.php/s/Qb9QYnDNMqS8XjC]
- [occam_meeting_room_mcmc.ply - https://polybox.ethz.ch/index.php/s/3t9sLdEF6nKATAw]
-
Place both
.plyfiles inside theoccam_backend/directory of the project.
- Open the project in Unity 2023.1.14f1
- Go to File > Build Settings, select Android, and click Switch Platform
- In the Project window, navigate to the
Scenesfolder and load the main scene - In the Hierarchy, you will see two game objects:
meeting_roomandbonsai- Important: Enable only ONE of these objects at a time (disable the other)
- The enabled object determines which scene you will view
- Connect your Meta Quest 3 via USB
- Click Build and Run to deploy the application to the headset
The system requires both the VR app and the Python backend server running simultaneously.
Critical: The server and VR application must be configured for the same scene (either bonsai or meeting_room) and be connected to the same wi-fi network.
- Open a terminal and navigate to the
occam_backend/directory:
cd occam_backend/- Activate the virtual environment:
source venv/bin/activate- Launch the server for your chosen scene:
- For Bonsai scene:
python occam_server.py occam_bonsai_mcmc.ply- For Meeting Room scene:
python occam_server.py occam_meeting_room_mcmc.ply- Put on the Meta Quest 3 headset
- Ensure the headset is connected to the same Wi-Fi network as your PC
- Launch the deployed application on the headset
- Use the controllers to interact:
- Left Trigger: Start/end voice input
- Right Trigger (toggle): Switch between rendering modes (gaussian splats / occam similarities (black and white))
- A Button (right controller): Activate Occam Colored Similarities mode. Press A again or right trigger to switch to other rendering modes
- Press Left trigger, say the name of an object (e.g., "bonsai tree", "camera", "push toy") and press the trigger again
- The system will process your query and highlight matching objects in real-time
The current rendering mode is displayed in a panel at the top-left corner of your view:
- Standard Mode: Colored Gaussian Splats
- Occam Similarities: Grayscale scene with heatmap highlighting
- Occam Colored Similarities: Original colors with relevancy heatmap overlay
The meeting room scene was created using our customized version of SplatFactory. If you want to create your own language-queryable Gaussian scenes:
- Visit our SplatFactory fork for scene capture and training
- Follow the pipeline described in the report (Section 3.3):
- Capture the scene (video or multiple overlapping photos)
- Prepare a COLMAP dataset with camera poses
- Train the 3D Gaussian scene
- Extend with language features using Occam's LGS approach
- Optionally apply MCMC pruning to reduce Gaussian count for better performance
- Export the final scene as a
.plyfile with language feature fields
The bonsai scene used in this project is from the NeRF baselines paper.
├── Assets/ # Unity application components
│ └── ... # Interaction scripts, UI, camera controls
├── occam_backend/ # Python server
│ ├── occam_server.py # Main server script
│ ├── venv/ # Virtual environment
│ └── *.ply # Scene files (download separately)
└── package/ # Custom Gaussian Splatting rendering package
└── GaussianSplatRenderer.cs
Instead of alpha-blending full 512-dimensional language feature vectors (infeasible on mobile GPU), we:
- Precompute cosine similarities between the user query and each Gaussian's language feature on the server
- Send only scalar relevancy scores to the headset
- Alpha-blend single scalar values during rendering, achieving performance comparable to standard RGBA rendering
We use canonical queries ("object", "thing", "texture", "material") with softmax normalization to reduce noise and ensure highlighted items truly match user intent.
- MCMC Pruning: Reduces Gaussian count for better frame rates
- Near-Plane Culling: Prevents flickering when camera is close to objects
- Multi-Pass Rendering: Optional colored overlay mode for enhanced visualization
Our user study with 12 participants achieved:
- SUS Score: 78/100 (good usability)
- UEQ-S Pragmatic Quality: 1.542/3
- UEQ-S Hedonic Quality: 1.583/3
- Voice Input Preference: 72.7% of users preferred voice over keyboard
- Performance is limited by Gaussian count (~1 million max for stable frame rate)
- Requires external server connection (adds latency)
- Multi-pass rendering mode (Occam Colored Similarities) reduces frame rate and may introduce intense flickering if used with meeting room or larger scene
- Object highlighting can be noisy in some cases
This project builds upon:
- 3D Gaussian Splatting (Kerbl et al., 2023)
- Occam's LGS (Cheng et al., 2025)
- LangSplat (Qin et al., 2023)
- UnityGaussianSplatting (Aras Pranckevičius)
- SplatFactory (Our customized fork)
Special thanks to Alexander Veicht for supervision and providing the Occam implementation.