PyFlame

Native Deep Learning Framework for Cerebras WSE

PRE-RELEASE ALPHA 2.0

This software is in early development and is not yet ready for production use. APIs may change without notice. Use at your own risk. Project is moving closer to beta, still needs transforms.py to have functionality and not placeholder functions.

PyFlame is a tensor computation library designed natively for the Cerebras Wafer-Scale Engine (WSE), featuring lazy evaluation, automatic CSL code generation, and a Python-first API.

A Quick Note For Developers

The Nvidia vendor lock-in that CUDA/PyTorch created has become a multi-million dollar nightmare for a lot of enterprises that now struggle to get GPUs for their own AI projects. Understand what that means.

I released PyFlame and the very next day my phone was ringing almost off the hook. By the end of the next day I had secured THREE consulting agreements at $250k each and started discussions on a contract that as of this writing (just 6 days following the release of PyFlame) is looking like $2.8M

Understand what that means. I found a massive pain point, figured out how to solve it, actually GAVE AWAY the solution (sort of) and now I've obtained over $3.5 million in revenue.

Want to learn the SYSTEM that I use for doing that (because this is the 9th time I've done this exact thing this exact way)

Just go to: https://oaqlabs.com/vibe.html

About OA Quantum Labs

PyFlame is developed by OA Quantum Labs, a specialized engineering firm focused on high-performance and quantum computing.

What We Do

In the context of the PyFlame project we help organizations unlock the full potential of specialized hardware through custom developer tools, optimized frameworks, and performance engineering:

Custom Framework Development — Native tooling designed for your specific accelerator architecture
Performance Optimization — Squeeze maximum throughput from your existing hardware investments
Migration & Porting — Adapt existing ML workloads to new accelerator platforms
Training & Enablement — Get your team productive on specialized hardware faster

Why Work With Us

PyFlame demonstrates our approach: rather than forcing general-purpose tools onto specialized hardware, we build native solutions that leverage the unique strengths of each architecture. The result is dramatically better performance and a more intuitive developer experience.

If your organization is working with specialized AI accelerators, FPGAs, or custom silicon, we'd love to discuss how purpose-built tooling could transform your development workflow.

Get In Touch

Danny Wall — CTO, OA Quantum Labs dwall@oaqlabs.com | oaqlabs.com

Features

Native WSE Design: Built from the ground up for Cerebras architecture
Lazy Evaluation: Computation graphs are built lazily and executed on demand
CSL Code Generation: Automatic generation of optimized CSL kernels
2D Mesh Layouts: First-class support for tensor distribution across PEs
Python + C++ API: Use from Python or C++ with the same abstractions
NumPy Interoperability: Easy conversion to/from NumPy arrays
PyTorch-like API: Familiar nn.Module system for building models
Automatic Differentiation: Full autograd support for training
Complete Training Stack: Optimizers, loss functions, and LR schedulers

Project Status

Version: Pre-Release Alpha 1.0

Phase 1 (Core Infrastructure) - Complete

Core tensor class with lazy evaluation
Computation graph (IR) system
Shape inference
Elementwise operations (add, mul, relu, sigmoid, etc.)
Reduction operations (sum, mean, max, min)
Matrix multiplication
CSL code generation framework
Python bindings via pybind11
CPU reference implementation

Phase 2 (ML Primitives) - Complete

Requirements

CMake 3.18+
C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
Python 3.8+
pybind11 2.10+

Cerebras SDK (Optional)

PyFlame includes a CPU reference implementation that allows you to develop, test, and validate your models without access to Cerebras hardware. All tensor operations, graph building, and CSL code generation work without the SDK.

To actually execute on Cerebras WSE hardware, you need:

Cerebras SDK - This is proprietary software available only to Cerebras customers and partners. It is not publicly downloadable.
Access to Cerebras hardware - Either on-premises CS-2/CS-3 systems or Cerebras Cloud.

Supported deployment options:

Environment	Runtime Address	Notes
On-premises CS-2/CS-3	`localhost:9000` or system IP	Direct hardware access
Cerebras Cloud	Cloud endpoint URL	Provided by your cloud instance

If you are interested in running PyFlame on Cerebras hardware, please contact Cerebras Systems to inquire about SDK access and hardware availability.

To build with Cerebras SDK support (once you have access):

cmake .. -DPYFLAME_USE_CEREBRAS_SDK=ON -DCEREBRAS_SDK_PATH=/path/to/sdk

To configure the runtime endpoint (for cloud or remote on-premises):

export CEREBRAS_RUNTIME_ADDRESS="your-endpoint:port"

Building

Linux/macOS

# Clone the repository
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build . -j$(nproc)

# Run tests
ctest --output-on-failure

# Install Python package (development mode)
pip install -e .

Windows

# Clone and build
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

mkdir build
cd build

cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release

ctest -C Release --output-on-failure

Quick Start

Python

import pyflame as pf

# Create tensors
a = pf.randn([1024, 512])
b = pf.randn([512, 256])

# Build computation graph (lazy)
c = a @ b              # Matrix multiply
d = pf.relu(c)         # Activation
e = d.sum()            # Reduction

# Execute
result = pf.eval(e)
print(result.numpy())

# With explicit mesh layout for WSE
x = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
y = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
z = x @ y  # Distributed across 256 PEs

Training a Neural Network

import pyflame as pf
from pyflame import nn, optim

# Define a model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Setup optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Training step
x = pf.randn([32, 784])  # Batch of inputs
y = pf.randint(0, 10, [32])  # Labels

optimizer.zero_grad()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()

C++

#include <pyflame/pyflame.hpp>
using namespace pyflame;

int main() {
    auto a = Tensor::randn({1024, 512});
    auto b = Tensor::randn({512, 256});

    auto c = matmul(a, b);
    auto d = relu(c);
    auto e = d.sum();

    e.eval();
    std::cout << "Result: " << e.data<float>()[0] << "\n";

    return 0;
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PyFlame User API (Python/C++)            │
│   - Tensor abstraction with lazy evaluation                │
│   - Dataflow-aware operators                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│               PyFlame Intermediate Representation           │
│   - Computation graph with shape inference                 │
│   - Optimization passes (fusion, layout, etc.)             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  PyFlame CSL Backend                        │
│   - Template-based code generation                         │
│   - PE placement and routing optimization                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              CSL Runtime / Cerebras Hardware                │
│   - 850,000+ Processing Elements                           │
│   - 2D mesh fabric with wavelet communication              │
└─────────────────────────────────────────────────────────────┘

Project Structure

pyflame/
├── CMakeLists.txt           # Build configuration
├── include/pyflame/         # C++ headers
│   ├── core/                # Tensor, DType, Layout
│   ├── ir/                  # Graph IR, operations
│   └── backend/             # CSL code generation
├── src/                     # C++ implementation
├── python/                  # Python bindings
│   ├── pyflame/             # Python package
│   └── bindings.cpp         # pybind11 bindings
├── tests/                   # Unit tests
│   ├── cpp/                 # C++ tests (Google Test)
│   └── python/              # Python tests (pytest)
├── examples/                # Example programs
│   ├── cpp/
│   └── python/
└── docs/                    # Design documentation

Documentation

Developer Guides

New to PyFlame? Start here:

Getting Started - Installation and first steps
Integration Guide - Adding PyFlame to your project
API Reference - Complete function documentation
Examples - Practical code examples
Best Practices - Optimization tips and patterns

Design Documents

Internal architecture documentation:

API Reference

Tensor Creation

Function	Description
`pf.zeros(shape)`	Create tensor filled with zeros
`pf.ones(shape)`	Create tensor filled with ones
`pf.full(shape, value)`	Create tensor filled with value
`pf.randn(shape)`	Random normal distribution
`pf.rand(shape)`	Random uniform [0, 1)
`pf.arange(start, end)`	Range of values
`pf.from_numpy(arr)`	From NumPy array

Operations

Category	Functions
Arithmetic	`+`, `-`, `*`, `/`, `@` (matmul)
Activations	`relu`, `sigmoid`, `tanh`, `gelu`, `silu`, `softmax`
Math	`abs`, `sqrt`, `exp`, `log`, `sin`, `cos`
Reductions	`sum`, `mean`, `max`, `min`
Shape	`reshape`, `transpose`, `squeeze`, `unsqueeze`
Combination	`cat`, `stack`

Neural Network Layers (nn)

Layer	Description
`nn.Linear`	Fully connected layer
`nn.Conv1d`, `nn.Conv2d`	Convolutional layers
`nn.BatchNorm1d`, `nn.BatchNorm2d`	Batch normalization
`nn.LayerNorm`, `nn.GroupNorm`	Layer and group normalization
`nn.MaxPool2d`, `nn.AvgPool2d`	Pooling layers
`nn.Dropout`	Dropout regularization
`nn.MultiheadAttention`	Multi-head attention

Loss Functions (nn)

Loss	Description
`nn.MSELoss`	Mean squared error
`nn.L1Loss`	Mean absolute error
`nn.CrossEntropyLoss`	Cross-entropy for classification
`nn.BCELoss`, `nn.BCEWithLogitsLoss`	Binary cross-entropy
`nn.NLLLoss`	Negative log likelihood
`nn.KLDivLoss`	KL divergence

Optimizers (optim)

Optimizer	Description
`optim.SGD`	Stochastic gradient descent with momentum
`optim.Adam`	Adam optimizer
`optim.AdamW`	Adam with decoupled weight decay
`optim.RMSprop`	RMSprop optimizer

Learning Rate Schedulers (optim)

Scheduler	Description
`optim.StepLR`	Decay LR every N steps
`optim.CosineAnnealingLR`	Cosine annealing schedule
`optim.ReduceLROnPlateau`	Reduce LR when metric plateaus
`optim.OneCycleLR`	One-cycle learning rate policy

Layouts

Layout	Description
`MeshLayout.single_pe()`	All data on one PE
`MeshLayout.row_partition(n)`	Split rows across n PEs
`MeshLayout.col_partition(n)`	Split columns across n PEs
`MeshLayout.grid(r, c)`	2D tiling across r×c PEs

Contributing

Contributions are welcome! Please see our contributing guidelines (coming soon).

License

This project is licensed under the Apache License 2.0.

Acknowledgments

Cerebras Systems for the WSE architecture and SDK
The PyTorch team for API inspiration
The JAX/XLA team for compiler architecture insights

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
build-rocm		build-rocm
docker		docker
docs		docs
examples		examples
include/pyflame		include/pyflame
python		python
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

CTO92/PyFlame

Folders and files

Latest commit

History

Repository files navigation

PyFlame

A Quick Note For Developers

About OA Quantum Labs

What We Do

Why Work With Us

Get In Touch

Features

Project Status

Requirements

Cerebras SDK (Optional)

Building

Linux/macOS

Windows

Quick Start

Python

Training a Neural Network

C++

Architecture

Project Structure

Documentation

Developer Guides

Design Documents

API Reference

Tensor Creation

Operations

Neural Network Layers (nn)

Loss Functions (nn)

Optimizers (optim)

Learning Rate Schedulers (optim)

Layouts

Contributing

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages