Python/Mojo interface for Google Gemma 3.
- Embeddings β Dense vector embeddings via a pure Mojo backend.
- Text generation β Synchronous and async streaming with configurable sampling.
- Multimodal Vision β Native support for Gemma 3 Vision models with zero-copy image processing.
- Google Cloud Storage β Automatic model download from Google's
gemma-databucket. - OpenTelemetry β Optional tracing instrumentation.
Recommended for most users:
pip install 'mogemma[llm]'This enables the text generation and embedding examples shown below.
For multimodal generation with automatic image decoding from str, Path, or raw bytes inputs:
pip install 'mogemma[vision]'Base package only:
pip install mogemmaUse the base package if you're already preparing tokens or image arrays yourself.
The default getting-started path is mogemma[llm].
from mogemma import SyncGemmaModel
model = SyncGemmaModel()
print(model.generate("Write a haiku about a robot discovering coffee:"))MoGemma supports Gemma 3 multimodal vision models.
- Install
mogemma[vision]to pass image file paths or raw image bytes directly.
from mogemma import SyncGemmaModel
# Initialize a vision-capable model
model = SyncGemmaModel("gemma3-4b-it")
response = model.generate("Describe this image in detail:", images=["input.jpg"])
print(response)import asyncio
from mogemma import AsyncGemmaModel
async def main():
model = AsyncGemmaModel()
async for token in model.generate_stream("Once upon a time"):
print(token, end="", flush=True)
asyncio.run(main())Generate dense vector embeddings natively through Mojo's optimized batched kernel operations. Pass a single string or a list of strings to process them in parallel.
from mogemma import SyncEmbeddingModel
model = SyncEmbeddingModel()
embeddings = model.embed(["Hello, world!", "Mojo runs Gemma inference."])
print(embeddings.shape) # (2, 768)All model classes default to gemma3-270m-it. Pass a model ID to use a different variant:
model = SyncGemmaModel("gemma3-1b-it")For full control over sampling parameters, pass a GenerationConfig:
from mogemma import GenerationConfig, SyncGemmaModel
config = GenerationConfig(model_path="gemma3-1b-it", temperature=0.7)
model = SyncGemmaModel(config)GenerationConfig and EmbeddingConfig accept:
device="cpu"device="gpu"device="gpu:0"(or other index)
Device handling is deterministic:
device="cpu"always runs on CPU- explicit GPU requests never silently fall back to CPU
- unavailable GPU requests raise an explicit error
Current runtime status:
cpuandgpuare executable backends todaygpu/gpu:Nexecute via a mathematically verified runtime polyfill
from mogemma import EmbeddingConfig, SyncEmbeddingModel, GenerationConfig, SyncGemmaModel
generation = SyncGemmaModel(
GenerationConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)
embeddings = SyncEmbeddingModel(
EmbeddingConfig(
model_path="gemma3-1b-it",
device="cpu",
)
)GPU Requirements: GPU acceleration requires Mojo nightly with GPU support, compatible GPU drivers (NVIDIA CUDA, AMD ROCm, or Apple Metal), and sufficient VRAM for model weights and KV cache.
MoGemma leverages the latest Mojo features for maximum performance.
- Mojo Nightly: Version
0.26.3.0.devor later is required for building from source. - Python: 3.10+
MoGemma automatically optimizes its Mojo core for your specific CPU architecture during the build process.
- x86_64: Uses
--target-cpu x86-64-v3for optimized vector instructions. - aarch64: Uses native ARM optimizations.
To build the Mojo extension locally:
make buildMIT