Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
__pycache__/
data/
dllm/__pycache__/
outputs/
tests/__pycache__/
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
109 changes: 109 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Diffusion Transformer for ARC

This repository contains an experimental diffusion transformer training pipeline for ARC-AGI style reasoning tasks.

## Dataset

The training script expects the canonical ARC task JSON files from the official [fchollet/ARC](https://github.com/fchollet/ARC) repository. Download or clone that repository and point the training command at the `data` directory inside it, which contains the `training/` and `evaluation/` folders:

```bash
# Example setup
mkdir -p data
cd data
curl -L https://github.com/fchollet/ARC/archive/refs/heads/master.zip -o arc.zip
unzip arc.zip 'ARC-AGI-master/data/*'
cd ..

# Run training with a YAML config (see the Configuration section below)
python train_diffusion_arc.py path/to/config.yaml
```

Any mirror with the same folder structure will also work. The `ARCTaskDataset` loader simply walks every `*.json` file inside the specified split directory.

## Visualization

To inspect the batches used during training, including how the diffusion process corrupts the targets, run the visualization helper:

```bash
python batch_visualization.py data/ARC-AGI-master/data --checkpoint outputs/diffusion_arc/final_model.pt
```

This command saves `train_batches.png` and `val_batches.png` under `outputs/visualizations/`, each showing five batches of samples with the condition, target, and a randomly corrupted view at different diffusion timesteps (defaulting to a compact 0–99 range).

## Configuration

Training is configured through a YAML file validated by `DiffusionArcTrainingConfig` from `train_diffusion_arc.py` using [`pydantic_config`](https://github.com/samsja/pydantic_config). Install the dependency with:

```bash
pip install pydantic_config pyyaml
```

Create a YAML file describing your run. Every field has a sensible default except `data_dir` which must point at the ARC dataset root. The available options are:

| Field | Description |
| --- | --- |
| `data_dir` | Path to the ARC dataset root containing `training/` and `evaluation/` folders. The directory must exist. |
| `output_dir` | Directory where checkpoints and the final model will be written (created automatically when missing). |
| `batch_size` | Batch size for both training and validation loaders (must be ≥ 1). |
| `epochs` | Number of full passes over the training set (must be ≥ 1). |
| `lr` / `weight_decay` | AdamW optimizer hyper-parameters (learning rate must be > 0). |
| `timesteps` | Number of diffusion steps in the schedule (must be ≥ 1). |
| `val_fraction` | Fraction of the dataset used for validation. Values > 0 reserve at least one example when possible and must be < 1. |
| `seed` | Random seed for Python, PyTorch and data splits. |
| `grad_clip` | Gradient clipping value (set to `0` to disable). |
| `device` | Device string understood by `torch.device`, defaults to `cuda` when available. |
| `ema` | Exponential moving average decay for model weights (`0` disables EMA, must be between `0` and `1`). |
| `duality_weight` | Weight applied to the clean target reconstruction loss term (must be ≥ 0). |
| `log_interval` | Number of training steps between log messages (must be ≥ 1). |
| `num_workers` | Data loader worker count (must be ≥ 0). |
| `save_interval` | Save a checkpoint every N epochs (must be ≥ 1). |
| `resume` | Optional path to a checkpoint to resume from. The file must exist when provided. |
| `augment` | Enable random grid flips during dataset loading. |
| `mixed_precision` | Enable automatic mixed precision training. |
| `max_grid_size`, `d_model`, `num_heads`, `num_layers`, `dim_feedforward`, `time_embed_dim` | Architectural parameters passed to `DiffusionTransformerConfig`. |

Relative paths are resolved from the directory that contains the YAML file, so a configuration can live alongside the data and checkpoints.
Absolute paths continue to work as usual.

Example configuration:

```yaml
data_dir: data/ARC-master/data
output_dir: outputs/diffusion_arc
batch_size: 32
epochs: 50
lr: 0.0003
weight_decay: 0.01
timesteps: 1000
val_fraction: 0.1
seed: 42
grad_clip: 1.0
device: cuda
ema: 0.0
duality_weight: 0.5
log_interval: 100
num_workers: 2
save_interval: 5
augment: false
mixed_precision: false
max_grid_size: 30
d_model: 288
num_heads: 8
num_layers: 7
dim_feedforward: 1152
time_embed_dim: 512
```

Run training by pointing the script at your YAML file:

```bash
python train_diffusion_arc.py path/to/config.yaml
```

## Tests

A minimal CPU smoke test is available via:

```bash
pytest tests/test_train_diffusion_arc.py -k tiny_cpu
```
146 changes: 146 additions & 0 deletions docs/arc_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# ARC Dataset, DataLoader, and Known Problems

This document explains how the project loads ARC-AGI style tasks, how the
`torch.utils.data.Dataset` and `DataLoader` are configured, what tensors are
contained in each training batch, and the main problems with the current
implementation.

## Directory structure and input format

The dataset utilities expect the canonical directory layout distributed in the
[`fchollet/ARC`](https://github.com/fchollet/ARC) repository. When you download
that dataset the root directory contains the sub-folders:

```
<root>/training/
<root>/evaluation/
```

Each sub-folder stores multiple `*.json` task files. Every file contains a list
of training examples under the `"train"` key (the original ARC format also
provides a `"test"` list, which we do not consume during model training).

Within the JSON file each entry inside `"train"` is a dictionary with `"input"`
and `"output"` fields. Each field is a 2-D list of integers representing a color
grid. The integers fall in the range `[0, 9]` for the ten canonical ARC colors.

## `ARCTaskDataset`

[`dllm/arc_dataset.py`](../dllm/arc_dataset.py) defines the
`ARCTaskDataset` class, which inherits from `torch.utils.data.Dataset`.
Key behaviors:

* **Initialization** – the constructor walks the chosen `training` or
`evaluation` split directory and loads every JSON file. For every pair inside
the `"train"` list the dataset stores an `ARCExample` dataclass with
`input_grid` and `output_grid` attributes.【F:dllm/arc_dataset.py†L39-L73】
* **Grid padding** – ARC tasks contain grids of varying size. Before they can
be fed to the model, each grid is padded to a fixed `max_grid_size ×
max_grid_size` square (default 30×30). Padding is handled by the private
`_pad_grid` helper, which returns both the flattened token tensor and a mask
that marks real (value `1.0`) versus padded (value `0.0`) cells. The padding
token defaults to `10`, which lies outside the normal color range so models
can distinguish padding from real pixels.【F:dllm/arc_dataset.py†L22-L63】
* **Samples** – calling `dataset[idx]` yields a dictionary with four keys:
`"condition"`, `"condition_mask"`, `"target"`, and `"target_mask"`. Each is a
1-D tensor of length `max_grid_size ** 2`. `condition` and `condition_mask`
correspond to the example’s input grid, while `target` and `target_mask`
describe the desired output grid. When `augment=True`, random horizontal and
vertical flips are applied to both grids (and masks) with independent
probability `0.5` each.【F:dllm/arc_dataset.py†L65-L116】

The dataset’s length equals the number of `train` pairs found across every JSON
file in the selected split. Importantly, ARC refers to each JSON file as a
single *task* that bundles several input/output demonstrations. The
`ARCTaskDataset` flattens those demonstrations so that every individual
`{"input": ..., "output": ...}` pair becomes its own dataset element. When a
`DataLoader` batches items together (often with `shuffle=True`), the batch may
contain examples originating from many different tasks. There is no special
grouping to keep demonstrations from the same task adjacent, because the
current training objective treats every demonstration independently.

## Collation and DataLoader configuration

Training scripts construct PyTorch `DataLoader` instances using the custom
`arc_collate` function defined alongside the dataset class.

* **`arc_collate`** – this function receives a list of per-item dictionaries and
stacks the `condition`, `condition_mask`, `target`, and `target_mask` tensors
into batched tensors with shape `(batch_size, max_grid_size**2)`. The output
is a dictionary with the same four keys expected by the model.【F:dllm/arc_dataset.py†L118-L128】
* **`DataLoader` setup** – for example, `train_diffusion_arc.py` creates the
dataset, randomly splits it into training and validation subsets, and then
wraps them with `DataLoader` objects that specify:
* `collate_fn=arc_collate`
* `shuffle=True` for the training loader and `False` for validation
* `batch_size` configured from the command-line (default `32`)
* `num_workers` and `pin_memory` tuned for efficient GPU feeding.【F:train_diffusion_arc.py†L69-L115】

## Batch contents

Each batch produced by the `DataLoader` is a dictionary with four entries:

| Key | Shape | DType | Description |
| ------------------ | -------------------------------- | --------------- | ------------------------------------------------------------------------ |
| `"condition"` | `(batch_size, max_grid_size**2)` | `torch.long` | Flattened input grid tokens with padding tokens (`10`) filling leftovers. |
| `"condition_mask"` | `(batch_size, max_grid_size**2)` | `torch.float32` | Binary mask (1.0 where the input grid is real, 0.0 on padding). |
| `"target"` | `(batch_size, max_grid_size**2)` | `torch.long` | Flattened output grid tokens padded to the same length. |
| `"target_mask"` | `(batch_size, max_grid_size**2)` | `torch.float32` | Binary mask for the output grid, matching the padding pattern. |

You can move the entire batch to a device using a simple comprehension, as done
in the training script’s `to_device` helper.【F:train_diffusion_arc.py†L57-L64】

These tensors supply the diffusion transformer with both the conditioning input
and the desired target, while the masks allow the loss function to ignore padded
cells when computing reconstruction errors.

## Known problems

Although the sections above describe the intended pipeline, several issues in
the current codebase prevent the ARC loader from matching the canonical task
structure and modelling objective.

### Tasks are flattened into unrelated examples

`ARCTaskDataset` loads every `{"input", "output"}` pair independently and stores
them as separate items in the `examples` list.【F:dllm/arc_dataset.py†L44-L63】 In
effect, the dataset breaks the ARC convention that all demonstrations belonging
to a task should be seen together. When the training script later shuffles the
dataset and slices it with `random_split`, individual demonstrations from the
same task can land in different batches, and even in different train/validation
splits.【F:train_diffusion_arc.py†L74-L105】 This destroys the contextual signal
that ARC solvers rely on (observing multiple demonstrations before producing an
answer for a held-out input), and it introduces leakage where the validation set
may still expose partial information from training tasks.

### Data augmentation corrupts the provided grids

When the `--augment` flag is enabled, `_augment` performs random horizontal and
vertical flips on both the input (`condition`) and the output (`target`) grids in
each sample.【F:dllm/arc_dataset.py†L86-L116】 ARC demonstrations are carefully
constructed; transforming the input grid changes the puzzle itself and can make
the paired output meaningless. Because the goal is to generate the output grid
given an unmodified input, these flips effectively corrupt the supervision
signal by altering the examples that should remain fixed.

A safer strategy would be to apply the **same** augmentation to every
demonstration belonging to a task so that relative relationships stay intact, or
to restrict augmentation to the generated output while leaving the conditioning
input untouched. Another promising idea is to treat all demonstrations in a task
as a candidate target: for a task that ships four examples, the loader could
pick one of the demonstrations as the "output" and repurpose the remaining three
as conditioning inputs, cycling this choice across epochs. Either approach would
respect the intent of ARC tasks while still expanding the variety of supervision
the model sees.

### Diffusion objective ignores task-specific geometry

During training, `compute_loss` embeds the target tokens, applies Gaussian noise,
and asks the model to predict that noise.【F:train_diffusion_arc.py†L187-L235】
While this is standard for diffusion models, the implementation does not supply
the true target mask to the sampler: `DiffusionTransformer.sample` always
constructs an all-ones `target_mask`, forcing the model to denoise a full
30×30 grid regardless of the original puzzle size.【F:dllm/diffusion_transformer.py†L120-L160】
Consequently the network must learn to hallucinate outputs for padded regions
that should remain unused, and the sampling procedure cannot take advantage of
the sparsity information available in the dataset.
13 changes: 13 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
[project]
name = "dllm"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = [
"matplotlib>=3.10.7",
"numpy>=2.3.3",
"pytest>=8.4.2",
"torch>=2.8.0",
"transformers>=4.57.0",
]
Loading