Skip to content
/ PyFlame Public

A Python deep learning framework with lazy evaluation, automatic differentiation, and a PyTorch-like API. Features include neural network modules, data loading, training utilities, model serving, and integrations with MLflow, W&B, ONNX, and Jupyter.

License

Notifications You must be signed in to change notification settings

CTO92/PyFlame

Repository files navigation

PyFlame

Native Deep Learning Framework for Cerebras WSE

PRE-RELEASE ALPHA 2.0

This software is in early development and is not yet ready for production use. APIs may change without notice. Use at your own risk. Project is moving closer to beta, still needs transforms.py to have functionality and not placeholder functions.

PyFlame is a tensor computation library designed natively for the Cerebras Wafer-Scale Engine (WSE), featuring lazy evaluation, automatic CSL code generation, and a Python-first API.

A Quick Note For Developers

The Nvidia vendor lock-in that CUDA/PyTorch created has become a multi-million dollar nightmare for a lot of enterprises that now struggle to get GPUs for their own AI projects. Understand what that means.

I released PyFlame and the very next day my phone was ringing almost off the hook. By the end of the next day I had secured THREE consulting agreements at $250k each and started discussions on a contract that as of this writing (just 6 days following the release of PyFlame) is looking like $2.8M

Understand what that means. I found a massive pain point, figured out how to solve it, actually GAVE AWAY the solution (sort of) and now I've obtained over $3.5 million in revenue.

Want to learn the SYSTEM that I use for doing that (because this is the 9th time I've done this exact thing this exact way)

Just go to: https://oaqlabs.com/vibe.html

About OA Quantum Labs

PyFlame is developed by OA Quantum Labs, a specialized engineering firm focused on high-performance and quantum computing.

What We Do

In the context of the PyFlame project we help organizations unlock the full potential of specialized hardware through custom developer tools, optimized frameworks, and performance engineering:

  • Custom Framework Development — Native tooling designed for your specific accelerator architecture
  • Performance Optimization — Squeeze maximum throughput from your existing hardware investments
  • Migration & Porting — Adapt existing ML workloads to new accelerator platforms
  • Training & Enablement — Get your team productive on specialized hardware faster

Why Work With Us

PyFlame demonstrates our approach: rather than forcing general-purpose tools onto specialized hardware, we build native solutions that leverage the unique strengths of each architecture. The result is dramatically better performance and a more intuitive developer experience.

If your organization is working with specialized AI accelerators, FPGAs, or custom silicon, we'd love to discuss how purpose-built tooling could transform your development workflow.

Get In Touch

Danny Wall — CTO, OA Quantum Labs dwall@oaqlabs.com | oaqlabs.com

Features

  • Native WSE Design: Built from the ground up for Cerebras architecture
  • Lazy Evaluation: Computation graphs are built lazily and executed on demand
  • CSL Code Generation: Automatic generation of optimized CSL kernels
  • 2D Mesh Layouts: First-class support for tensor distribution across PEs
  • Python + C++ API: Use from Python or C++ with the same abstractions
  • NumPy Interoperability: Easy conversion to/from NumPy arrays
  • PyTorch-like API: Familiar nn.Module system for building models
  • Automatic Differentiation: Full autograd support for training
  • Complete Training Stack: Optimizers, loss functions, and LR schedulers

Project Status

Version: Pre-Release Alpha 1.0

Phase 1 (Core Infrastructure) - Complete

  • Core tensor class with lazy evaluation
  • Computation graph (IR) system
  • Shape inference
  • Elementwise operations (add, mul, relu, sigmoid, etc.)
  • Reduction operations (sum, mean, max, min)
  • Matrix multiplication
  • CSL code generation framework
  • Python bindings via pybind11
  • CPU reference implementation

Phase 2 (ML Primitives) - Complete

  • Automatic differentiation (autograd)
  • Neural network module system (nn.Module)
  • Linear layers (Linear)
  • Convolutional layers (Conv1d, Conv2d)
  • Normalization layers (BatchNorm, LayerNorm, GroupNorm)
  • Pooling layers (MaxPool, AvgPool, AdaptivePool)
  • Dropout layers
  • Multi-head attention
  • Loss functions (MSE, CrossEntropy, BCE, etc.)
  • Optimizers (SGD, Adam, AdamW, RMSprop)
  • Learning rate schedulers

Requirements

  • CMake 3.18+
  • C++17 compiler (GCC 9+, Clang 10+, MSVC 2019+)
  • Python 3.8+
  • pybind11 2.10+

Cerebras SDK (Optional)

PyFlame includes a CPU reference implementation that allows you to develop, test, and validate your models without access to Cerebras hardware. All tensor operations, graph building, and CSL code generation work without the SDK.

To actually execute on Cerebras WSE hardware, you need:

  • Cerebras SDK - This is proprietary software available only to Cerebras customers and partners. It is not publicly downloadable.
  • Access to Cerebras hardware - Either on-premises CS-2/CS-3 systems or Cerebras Cloud.

Supported deployment options:

Environment Runtime Address Notes
On-premises CS-2/CS-3 localhost:9000 or system IP Direct hardware access
Cerebras Cloud Cloud endpoint URL Provided by your cloud instance

If you are interested in running PyFlame on Cerebras hardware, please contact Cerebras Systems to inquire about SDK access and hardware availability.

To build with Cerebras SDK support (once you have access):

cmake .. -DPYFLAME_USE_CEREBRAS_SDK=ON -DCEREBRAS_SDK_PATH=/path/to/sdk

To configure the runtime endpoint (for cloud or remote on-premises):

export CEREBRAS_RUNTIME_ADDRESS="your-endpoint:port"

Building

Linux/macOS

# Clone the repository
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

# Create build directory
mkdir build && cd build

# Configure
cmake .. -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build . -j$(nproc)

# Run tests
ctest --output-on-failure

# Install Python package (development mode)
pip install -e .

Windows

# Clone and build
git clone https://github.com/CTO92/PyFlame.git
cd pyflame

mkdir build
cd build

cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release

ctest -C Release --output-on-failure

Quick Start

Python

import pyflame as pf

# Create tensors
a = pf.randn([1024, 512])
b = pf.randn([512, 256])

# Build computation graph (lazy)
c = a @ b              # Matrix multiply
d = pf.relu(c)         # Activation
e = d.sum()            # Reduction

# Execute
result = pf.eval(e)
print(result.numpy())

# With explicit mesh layout for WSE
x = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
y = pf.zeros([4096, 4096], layout=pf.MeshLayout.grid(16, 16))
z = x @ y  # Distributed across 256 PEs

Training a Neural Network

import pyflame as pf
from pyflame import nn, optim

# Define a model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Setup optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

# Training step
x = pf.randn([32, 784])  # Batch of inputs
y = pf.randint(0, 10, [32])  # Labels

optimizer.zero_grad()
output = model(x)
loss = loss_fn(output, y)
loss.backward()
optimizer.step()

C++

#include <pyflame/pyflame.hpp>
using namespace pyflame;

int main() {
    auto a = Tensor::randn({1024, 512});
    auto b = Tensor::randn({512, 256});

    auto c = matmul(a, b);
    auto d = relu(c);
    auto e = d.sum();

    e.eval();
    std::cout << "Result: " << e.data<float>()[0] << "\n";

    return 0;
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    PyFlame User API (Python/C++)            │
│   - Tensor abstraction with lazy evaluation                │
│   - Dataflow-aware operators                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│               PyFlame Intermediate Representation           │
│   - Computation graph with shape inference                 │
│   - Optimization passes (fusion, layout, etc.)             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  PyFlame CSL Backend                        │
│   - Template-based code generation                         │
│   - PE placement and routing optimization                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              CSL Runtime / Cerebras Hardware                │
│   - 850,000+ Processing Elements                           │
│   - 2D mesh fabric with wavelet communication              │
└─────────────────────────────────────────────────────────────┘

Project Structure

pyflame/
├── CMakeLists.txt           # Build configuration
├── include/pyflame/         # C++ headers
│   ├── core/                # Tensor, DType, Layout
│   ├── ir/                  # Graph IR, operations
│   └── backend/             # CSL code generation
├── src/                     # C++ implementation
├── python/                  # Python bindings
│   ├── pyflame/             # Python package
│   └── bindings.cpp         # pybind11 bindings
├── tests/                   # Unit tests
│   ├── cpp/                 # C++ tests (Google Test)
│   └── python/              # Python tests (pytest)
├── examples/                # Example programs
│   ├── cpp/
│   └── python/
└── docs/                    # Design documentation

Documentation

Developer Guides

New to PyFlame? Start here:

Design Documents

Internal architecture documentation:

API Reference

Tensor Creation

Function Description
pf.zeros(shape) Create tensor filled with zeros
pf.ones(shape) Create tensor filled with ones
pf.full(shape, value) Create tensor filled with value
pf.randn(shape) Random normal distribution
pf.rand(shape) Random uniform [0, 1)
pf.arange(start, end) Range of values
pf.from_numpy(arr) From NumPy array

Operations

Category Functions
Arithmetic +, -, *, /, @ (matmul)
Activations relu, sigmoid, tanh, gelu, silu, softmax
Math abs, sqrt, exp, log, sin, cos
Reductions sum, mean, max, min
Shape reshape, transpose, squeeze, unsqueeze
Combination cat, stack

Neural Network Layers (nn)

Layer Description
nn.Linear Fully connected layer
nn.Conv1d, nn.Conv2d Convolutional layers
nn.BatchNorm1d, nn.BatchNorm2d Batch normalization
nn.LayerNorm, nn.GroupNorm Layer and group normalization
nn.MaxPool2d, nn.AvgPool2d Pooling layers
nn.Dropout Dropout regularization
nn.MultiheadAttention Multi-head attention

Loss Functions (nn)

Loss Description
nn.MSELoss Mean squared error
nn.L1Loss Mean absolute error
nn.CrossEntropyLoss Cross-entropy for classification
nn.BCELoss, nn.BCEWithLogitsLoss Binary cross-entropy
nn.NLLLoss Negative log likelihood
nn.KLDivLoss KL divergence

Optimizers (optim)

Optimizer Description
optim.SGD Stochastic gradient descent with momentum
optim.Adam Adam optimizer
optim.AdamW Adam with decoupled weight decay
optim.RMSprop RMSprop optimizer

Learning Rate Schedulers (optim)

Scheduler Description
optim.StepLR Decay LR every N steps
optim.CosineAnnealingLR Cosine annealing schedule
optim.ReduceLROnPlateau Reduce LR when metric plateaus
optim.OneCycleLR One-cycle learning rate policy

Layouts

Layout Description
MeshLayout.single_pe() All data on one PE
MeshLayout.row_partition(n) Split rows across n PEs
MeshLayout.col_partition(n) Split columns across n PEs
MeshLayout.grid(r, c) 2D tiling across r×c PEs

Contributing

Contributions are welcome! Please see our contributing guidelines (coming soon).

License

This project is licensed under the Apache License 2.0.

Acknowledgments

  • Cerebras Systems for the WSE architecture and SDK
  • The PyTorch team for API inspiration
  • The JAX/XLA team for compiler architecture insights

About

A Python deep learning framework with lazy evaluation, automatic differentiation, and a PyTorch-like API. Features include neural network modules, data loading, training utilities, model serving, and integrations with MLflow, W&B, ONNX, and Jupyter.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •