Native Deep Learning Framework for Cerebras WSE
PRE-RELEASE ALPHA 2.0
This software is in early development and is not yet ready for production use. APIs may change without notice. Use at your own risk. Project is moving closer to beta, still needs transforms.py to have functionality and not placeholder functions.
PyFlame is a tensor computation library designed natively for the Cerebras Wafer-Scale Engine (WSE), featuring lazy evaluation, automatic CSL code generation, and a Python-first API.
The Nvidia vendor lock-in that CUDA/PyTorch created has become a multi-million dollar nightmare for a lot of enterprises that now struggle to get GPUs for their own AI projects. Understand what that means.
I released PyFlame and the very next day my phone was ringing almost off the hook. By the end of the next day I had secured THREE consulting agreements at $250k each and started discussions on a contract that as of this writing (just 6 days following the release of PyFlame) is looking like $2.8M
Understand what that means. I found a massive pain point, figured out how to solve it, actually GAVE AWAY the solution (sort of) and now I've obtained over $3.5 million in revenue.
Want to learn the SYSTEM that I use for doing that (because this is the 9th time I've done this exact thing this exact way)
Just go to: https://oaqlabs.com/vibe.html
PyFlame is developed by OA Quantum Labs, a specialized engineering firm focused on high-performance and quantum computing.
In the context of the PyFlame project we help organizations unlock the full potential of specialized hardware through custom developer tools, optimized frameworks, and performance engineering:
- Custom Framework Development — Native tooling designed for your specific accelerator architecture
- Performance Optimization — Squeeze maximum throughput from your existing hardware investments
- Migration & Porting — Adapt existing ML workloads to new accelerator platforms
- Training & Enablement — Get your team productive on specialized hardware faster
PyFlame demonstrates our approach: rather than forcing general-purpose tools onto specialized hardware, we build native solutions that leverage the unique strengths of each architecture. The result is dramatically better performance and a more intuitive developer experience.
If your organization is working with specialized AI accelerators, FPGAs, or custom silicon, we'd love to discuss how purpose-built tooling could transform your development workflow.
Danny Wall — CTO, OA Quantum Labs dwall@oaqlabs.com | oaqlabs.com
- Native WSE Design: Built from the ground up for Cerebras architecture
- Lazy Evaluation: Computation graphs are built lazily and executed on demand
- CSL Code Generation: Automatic generation of optimized CSL kernels
- 2D Mesh Layouts: First-class support for tensor distribution across PEs
- Python + C++ API: Use from Python or C++ with the same abstractions
- NumPy Interoperability: Easy conversion to/from NumPy arrays
- PyTorch-like API: Familiar nn.Module system for building models
- Automatic Differentiation: Full autograd support for training
- Complete Training Stack: Optimizers, loss functions, and LR schedulers
Version: Pre-Release Alpha 1.0
Phase 1 (Core Infrastructure) - Complete
- Core tensor class with lazy evaluation
- Computation graph (IR) system
- Shape inference
- Elementwise operations (add, mul, relu, sigmoid, etc.)
- Reduction operations (sum, mean, max, min)
- Matrix multiplication
- CSL code generation framework
- Python bindings via pybind11
- CPU reference implementation
Phase 2 (ML Primitives) - Complete
- Automatic differentiation (autograd)
- Neural network module system (nn.Module)
- Linear layers (Linear)
- Convolutional layers (Conv1d, Conv2d)
- Normalization layers (BatchNorm, LayerNorm, GroupNorm)
- Pooling layers (MaxPool, AvgPool, AdaptivePool)
- Dropout layers
- Multi-head attention
- Loss functions (MSE, CrossEntropy, BCE, etc.)
- Optimizers (SGD, Adam, AdamW, RMSprop)
- Learning rate schedulers
- CMake 3.18+
- C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
- Python 3.8+
- pybind11 2.10+
PyFlame includes a CPU reference implementation that allows you to develop, test, and validate your models without access to Cerebras hardware. All tensor operations, graph building, and CSL code generation work without the SDK.
To actually execute on Cerebras WSE hardware, you need:
- Cerebras SDK - This is proprietary software available only to Cerebras customers and partners. It is not publicly downloadable.
- Access to Cerebras hardware - Either on-premises CS-2/CS-3 systems or Cerebras Cloud.
Supported deployment options:
| Environment | Runtime Address | Notes |
|---|---|---|
| On-premises CS-2/CS-3 | localhost:9000 or system IP |
Direct hardware access |
| Cerebras Cloud | Cloud endpoint URL | Provided by your cloud instance |
If you are interested in running PyFlame on Cerebras hardware, please contact Cerebras Systems to inquire about SDK access and hardware availability.
To build with Cerebras SDK support (once you have access):
cmake .. -DPYFLAME_USE_CEREBRAS_SDK=ON -DCEREBRAS_SDK_PATH=/path/to/sdkTo configure the runtime endpoint (for cloud or remote on-premises):
export CEREBRAS_RUNTIME_ADDRESS="your-endpoint:port"# Clone the repository
git clone https://github.com/CTO92/PyFlame.git
cd pyflame
# Create build directory
mkdir build && cd build
# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release
# Build
cmake --build . -j$(nproc)
# Run tests
ctest --output-on-failure
# Install Python package (development mode)
pip install -e .# Clone and build
git clone https://github.com/CTO92/PyFlame.git
cd pyflame
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release
ctest -C Release --output-on-failureimport pyflame as pf
# Create tensors
a = pf.randn([1024, 512])
b = pf.randn([512, 256])
# Build computation graph (lazy)
c = a @ b # Matrix multiply
d = pf.relu(c) # Activation
e = d.sum() # Reduction
# Execute
result = pf.eval(e)
print(result.numpy())
# With explicit mesh layout for WSE
x = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
y = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
z = x @ y # Distributed across 256 PEsimport pyflame as pf
from pyflame import nn, optim
# Define a model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# Setup optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
# Training step
x = pf.randn([32, 784]) # Batch of inputs
y = pf.randint(0, 10, [32]) # Labels
optimizer.zero_grad()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()#include <pyflame/pyflame.hpp>
using namespace pyflame;
int main() {
auto a = Tensor::randn({1024, 512});
auto b = Tensor::randn({512, 256});
auto c = matmul(a, b);
auto d = relu(c);
auto e = d.sum();
e.eval();
std::cout << "Result: " << e.data<float>()[0] << "\n";
return 0;
}┌─────────────────────────────────────────────────────────────┐
│ PyFlame User API (Python/C++) │
│ - Tensor abstraction with lazy evaluation │
│ - Dataflow-aware operators │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PyFlame Intermediate Representation │
│ - Computation graph with shape inference │
│ - Optimization passes (fusion, layout, etc.) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PyFlame CSL Backend │
│ - Template-based code generation │
│ - PE placement and routing optimization │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CSL Runtime / Cerebras Hardware │
│ - 850,000+ Processing Elements │
│ - 2D mesh fabric with wavelet communication │
└─────────────────────────────────────────────────────────────┘
pyflame/
├── CMakeLists.txt # Build configuration
├── include/pyflame/ # C++ headers
│ ├── core/ # Tensor, DType, Layout
│ ├── ir/ # Graph IR, operations
│ └── backend/ # CSL code generation
├── src/ # C++ implementation
├── python/ # Python bindings
│ ├── pyflame/ # Python package
│ └── bindings.cpp # pybind11 bindings
├── tests/ # Unit tests
│ ├── cpp/ # C++ tests (Google Test)
│ └── python/ # Python tests (pytest)
├── examples/ # Example programs
│ ├── cpp/
│ └── python/
└── docs/ # Design documentation
New to PyFlame? Start here:
- Getting Started - Installation and first steps
- Integration Guide - Adding PyFlame to your project
- API Reference - Complete function documentation
- Examples - Practical code examples
- Best Practices - Optimization tips and patterns
Internal architecture documentation:
- CSL Code Generation Strategy
- Lazy Evaluation & Graph Building
- Memory Management & Layouts
- Build System Setup
- Phase 2: ML Primitives Plan
| Function | Description |
|---|---|
pf.zeros(shape) |
Create tensor filled with zeros |
pf.ones(shape) |
Create tensor filled with ones |
pf.full(shape, value) |
Create tensor filled with value |
pf.randn(shape) |
Random normal distribution |
pf.rand(shape) |
Random uniform [0, 1) |
pf.arange(start, end) |
Range of values |
pf.from_numpy(arr) |
From NumPy array |
| Category | Functions |
|---|---|
| Arithmetic | +, -, *, /, @ (matmul) |
| Activations | relu, sigmoid, tanh, gelu, silu, softmax |
| Math | abs, sqrt, exp, log, sin, cos |
| Reductions | sum, mean, max, min |
| Shape | reshape, transpose, squeeze, unsqueeze |
| Combination | cat, stack |
| Layer | Description |
|---|---|
nn.Linear |
Fully connected layer |
nn.Conv1d, nn.Conv2d |
Convolutional layers |
nn.BatchNorm1d, nn.BatchNorm2d |
Batch normalization |
nn.LayerNorm, nn.GroupNorm |
Layer and group normalization |
nn.MaxPool2d, nn.AvgPool2d |
Pooling layers |
nn.Dropout |
Dropout regularization |
nn.MultiheadAttention |
Multi-head attention |
| Loss | Description |
|---|---|
nn.MSELoss |
Mean squared error |
nn.L1Loss |
Mean absolute error |
nn.CrossEntropyLoss |
Cross-entropy for classification |
nn.BCELoss, nn.BCEWithLogitsLoss |
Binary cross-entropy |
nn.NLLLoss |
Negative log likelihood |
nn.KLDivLoss |
KL divergence |
| Optimizer | Description |
|---|---|
optim.SGD |
Stochastic gradient descent with momentum |
optim.Adam |
Adam optimizer |
optim.AdamW |
Adam with decoupled weight decay |
optim.RMSprop |
RMSprop optimizer |
| Scheduler | Description |
|---|---|
optim.StepLR |
Decay LR every N steps |
optim.CosineAnnealingLR |
Cosine annealing schedule |
optim.ReduceLROnPlateau |
Reduce LR when metric plateaus |
optim.OneCycleLR |
One-cycle learning rate policy |
| Layout | Description |
|---|---|
MeshLayout.single_pe() |
All data on one PE |
MeshLayout.row_partition(n) |
Split rows across n PEs |
MeshLayout.col_partition(n) |
Split columns across n PEs |
MeshLayout.grid(r, c) |
2D tiling across r×c PEs |
Contributions are welcome! Please see our contributing guidelines (coming soon).
This project is licensed under the Apache License 2.0.
- Cerebras Systems for the WSE architecture and SDK
- The PyTorch team for API inspiration
- The JAX/XLA team for compiler architecture insights