tom-pollak · tom-pollak · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+__pycache__/
+data/
+dllm/__pycache__/
+outputs/
+tests/__pycache__/
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.12
diff --git a/README.md b/README.md
@@ -0,0 +1,109 @@
+# Diffusion Transformer for ARC
+
+This repository contains an experimental diffusion transformer training pipeline for ARC-AGI style reasoning tasks.
+
+## Dataset
+
+The training script expects the canonical ARC task JSON files from the official [fchollet/ARC](https://github.com/fchollet/ARC) repository. Download or clone that repository and point the training command at the `data` directory inside it, which contains the `training/` and `evaluation/` folders:
+
+```bash
+# Example setup
+mkdir -p data
+cd data
+curl -L https://github.com/fchollet/ARC/archive/refs/heads/master.zip -o arc.zip
+unzip arc.zip 'ARC-AGI-master/data/*'
+cd ..
+
+# Run training with a YAML config (see the Configuration section below)
+python train_diffusion_arc.py path/to/config.yaml
+```
+
+Any mirror with the same folder structure will also work. The `ARCTaskDataset` loader simply walks every `*.json` file inside the specified split directory.
+
+## Visualization
+
+To inspect the batches used during training, including how the diffusion process corrupts the targets, run the visualization helper:
+
+```bash
+python batch_visualization.py data/ARC-AGI-master/data --checkpoint outputs/diffusion_arc/final_model.pt
+```
+
+This command saves `train_batches.png` and `val_batches.png` under `outputs/visualizations/`, each showing five batches of samples with the condition, target, and a randomly corrupted view at different diffusion timesteps (defaulting to a compact 0–99 range).
+
+## Configuration
+
+Training is configured through a YAML file validated by `DiffusionArcTrainingConfig` from `train_diffusion_arc.py` using [`pydantic_config`](https://github.com/samsja/pydantic_config). Install the dependency with:
+
+```bash
+pip install pydantic_config pyyaml
+```
+
+Create a YAML file describing your run. Every field has a sensible default except `data_dir` which must point at the ARC dataset root. The available options are:
+
+| Field | Description |
+| --- | --- |
+| `data_dir` | Path to the ARC dataset root containing `training/` and `evaluation/` folders. The directory must exist. |
+| `output_dir` | Directory where checkpoints and the final model will be written (created automatically when missing). |
+| `batch_size` | Batch size for both training and validation loaders (must be ≥ 1). |
+| `epochs` | Number of full passes over the training set (must be ≥ 1). |
+| `lr` / `weight_decay` | AdamW optimizer hyper-parameters (learning rate must be > 0). |
+| `timesteps` | Number of diffusion steps in the schedule (must be ≥ 1). |
+| `val_fraction` | Fraction of the dataset used for validation. Values > 0 reserve at least one example when possible and must be < 1. |
+| `seed` | Random seed for Python, PyTorch and data splits. |
+| `grad_clip` | Gradient clipping value (set to `0` to disable). |
+| `device` | Device string understood by `torch.device`, defaults to `cuda` when available. |
+| `ema` | Exponential moving average decay for model weights (`0` disables EMA, must be between `0` and `1`). |
+| `duality_weight` | Weight applied to the clean target reconstruction loss term (must be ≥ 0). |
+| `log_interval` | Number of training steps between log messages (must be ≥ 1). |
+| `num_workers` | Data loader worker count (must be ≥ 0). |
+| `save_interval` | Save a checkpoint every N epochs (must be ≥ 1). |
+| `resume` | Optional path to a checkpoint to resume from. The file must exist when provided. |
+| `augment` | Enable random grid flips during dataset loading. |
+| `mixed_precision` | Enable automatic mixed precision training. |
+| `max_grid_size`, `d_model`, `num_heads`, `num_layers`, `dim_feedforward`, `time_embed_dim` | Architectural parameters passed to `DiffusionTransformerConfig`. |
+
+Relative paths are resolved from the directory that contains the YAML file, so a configuration can live alongside the data and checkpoints.
+Absolute paths continue to work as usual.
+
+Example configuration:
+
+```yaml
+data_dir: data/ARC-master/data
+output_dir: outputs/diffusion_arc
+batch_size: 32
+epochs: 50
+lr: 0.0003
+weight_decay: 0.01
+timesteps: 1000
+val_fraction: 0.1
+seed: 42
+grad_clip: 1.0
+device: cuda
+ema: 0.0
+duality_weight: 0.5
+log_interval: 100
+num_workers: 2
+save_interval: 5
+augment: false
+mixed_precision: false
+max_grid_size: 30
+d_model: 288
+num_heads: 8
+num_layers: 7
+dim_feedforward: 1152
+time_embed_dim: 512
+```
+
+Run training by pointing the script at your YAML file:
+
+```bash
+python train_diffusion_arc.py path/to/config.yaml
+```
+
+## Tests
+
+A minimal CPU smoke test is available via:
+
+```bash
+pytest tests/test_train_diffusion_arc.py -k tiny_cpu
+```
diff --git a/docs/arc_dataset.md b/docs/arc_dataset.md
@@ -0,0 +1,146 @@
+# ARC Dataset, DataLoader, and Known Problems
+
+This document explains how the project loads ARC-AGI style tasks, how the
+`torch.utils.data.Dataset` and `DataLoader` are configured, what tensors are
+contained in each training batch, and the main problems with the current
+implementation.
+
+## Directory structure and input format
+
+The dataset utilities expect the canonical directory layout distributed in the
+[`fchollet/ARC`](https://github.com/fchollet/ARC) repository. When you download
+that dataset the root directory contains the sub-folders:
+
+```
+<root>/training/
+<root>/evaluation/
+```
+
+Each sub-folder stores multiple `*.json` task files. Every file contains a list
+of training examples under the `"train"` key (the original ARC format also
+provides a `"test"` list, which we do not consume during model training).
+
+Within the JSON file each entry inside `"train"` is a dictionary with `"input"`
+and `"output"` fields. Each field is a 2-D list of integers representing a color
+grid. The integers fall in the range `[0, 9]` for the ten canonical ARC colors.
+
+## `ARCTaskDataset`
+
+[`dllm/arc_dataset.py`](../dllm/arc_dataset.py) defines the
+`ARCTaskDataset` class, which inherits from `torch.utils.data.Dataset`.
+Key behaviors:
+
+* **Initialization** – the constructor walks the chosen `training` or
+  `evaluation` split directory and loads every JSON file. For every pair inside
+  the `"train"` list the dataset stores an `ARCExample` dataclass with
+  `input_grid` and `output_grid` attributes.【F:dllm/arc_dataset.py†L39-L73】
+* **Grid padding** – ARC tasks contain grids of varying size. Before they can
+  be fed to the model, each grid is padded to a fixed `max_grid_size ×
+  max_grid_size` square (default 30×30). Padding is handled by the private
+  `_pad_grid` helper, which returns both the flattened token tensor and a mask
+  that marks real (value `1.0`) versus padded (value `0.0`) cells. The padding
+  token defaults to `10`, which lies outside the normal color range so models
+  can distinguish padding from real pixels.【F:dllm/arc_dataset.py†L22-L63】
+* **Samples** – calling `dataset[idx]` yields a dictionary with four keys:
+  `"condition"`, `"condition_mask"`, `"target"`, and `"target_mask"`. Each is a
+  1-D tensor of length `max_grid_size ** 2`. `condition` and `condition_mask`
+  correspond to the example’s input grid, while `target` and `target_mask`
+  describe the desired output grid. When `augment=True`, random horizontal and
+  vertical flips are applied to both grids (and masks) with independent
+  probability `0.5` each.【F:dllm/arc_dataset.py†L65-L116】
+
+The dataset’s length equals the number of `train` pairs found across every JSON
+file in the selected split. Importantly, ARC refers to each JSON file as a
+single *task* that bundles several input/output demonstrations. The
+`ARCTaskDataset` flattens those demonstrations so that every individual
+`{"input": ..., "output": ...}` pair becomes its own dataset element. When a
+`DataLoader` batches items together (often with `shuffle=True`), the batch may
+contain examples originating from many different tasks. There is no special
+grouping to keep demonstrations from the same task adjacent, because the
+current training objective treats every demonstration independently.
+
+## Collation and DataLoader configuration
+
+Training scripts construct PyTorch `DataLoader` instances using the custom
+`arc_collate` function defined alongside the dataset class.
+
+* **`arc_collate`** – this function receives a list of per-item dictionaries and
+  stacks the `condition`, `condition_mask`, `target`, and `target_mask` tensors
+  into batched tensors with shape `(batch_size, max_grid_size**2)`. The output
+  is a dictionary with the same four keys expected by the model.【F:dllm/arc_dataset.py†L118-L128】
+* **`DataLoader` setup** – for example, `train_diffusion_arc.py` creates the
+  dataset, randomly splits it into training and validation subsets, and then
+  wraps them with `DataLoader` objects that specify:
+  * `collate_fn=arc_collate`
+  * `shuffle=True` for the training loader and `False` for validation
+  * `batch_size` configured from the command-line (default `32`)
+  * `num_workers` and `pin_memory` tuned for efficient GPU feeding.【F:train_diffusion_arc.py†L69-L115】
+
+## Batch contents
+
+Each batch produced by the `DataLoader` is a dictionary with four entries:
+
+| Key                | Shape                            | DType           | Description                                                              |
+| ------------------ | -------------------------------- | --------------- | ------------------------------------------------------------------------ |
+| `"condition"`      | `(batch_size, max_grid_size**2)` | `torch.long`    | Flattened input grid tokens with padding tokens (`10`) filling leftovers. |
+| `"condition_mask"` | `(batch_size, max_grid_size**2)` | `torch.float32` | Binary mask (1.0 where the input grid is real, 0.0 on padding).           |
+| `"target"`         | `(batch_size, max_grid_size**2)` | `torch.long`    | Flattened output grid tokens padded to the same length.                   |
+| `"target_mask"`    | `(batch_size, max_grid_size**2)` | `torch.float32` | Binary mask for the output grid, matching the padding pattern.            |
+
+You can move the entire batch to a device using a simple comprehension, as done
+in the training script’s `to_device` helper.【F:train_diffusion_arc.py†L57-L64】
+
+These tensors supply the diffusion transformer with both the conditioning input
+and the desired target, while the masks allow the loss function to ignore padded
+cells when computing reconstruction errors.
+
+## Known problems
+
+Although the sections above describe the intended pipeline, several issues in
+the current codebase prevent the ARC loader from matching the canonical task
+structure and modelling objective.
+
+### Tasks are flattened into unrelated examples
+
+`ARCTaskDataset` loads every `{"input", "output"}` pair independently and stores
+them as separate items in the `examples` list.【F:dllm/arc_dataset.py†L44-L63】 In
+effect, the dataset breaks the ARC convention that all demonstrations belonging
+to a task should be seen together. When the training script later shuffles the
+dataset and slices it with `random_split`, individual demonstrations from the
+same task can land in different batches, and even in different train/validation
+splits.【F:train_diffusion_arc.py†L74-L105】 This destroys the contextual signal
+that ARC solvers rely on (observing multiple demonstrations before producing an
+answer for a held-out input), and it introduces leakage where the validation set
+may still expose partial information from training tasks.
+
+### Data augmentation corrupts the provided grids
+
+When the `--augment` flag is enabled, `_augment` performs random horizontal and
+vertical flips on both the input (`condition`) and the output (`target`) grids in
+each sample.【F:dllm/arc_dataset.py†L86-L116】 ARC demonstrations are carefully
+constructed; transforming the input grid changes the puzzle itself and can make
+the paired output meaningless. Because the goal is to generate the output grid
+given an unmodified input, these flips effectively corrupt the supervision
+signal by altering the examples that should remain fixed.
+
+A safer strategy would be to apply the **same** augmentation to every
+demonstration belonging to a task so that relative relationships stay intact, or
+to restrict augmentation to the generated output while leaving the conditioning
+input untouched. Another promising idea is to treat all demonstrations in a task
+as a candidate target: for a task that ships four examples, the loader could
+pick one of the demonstrations as the "output" and repurpose the remaining three
+as conditioning inputs, cycling this choice across epochs. Either approach would
+respect the intent of ARC tasks while still expanding the variety of supervision
+the model sees.
+
+### Diffusion objective ignores task-specific geometry
+
+During training, `compute_loss` embeds the target tokens, applies Gaussian noise,
+and asks the model to predict that noise.【F:train_diffusion_arc.py†L187-L235】
+While this is standard for diffusion models, the implementation does not supply
+the true target mask to the sampler: `DiffusionTransformer.sample` always
+constructs an all-ones `target_mask`, forcing the model to denoise a full
+30×30 grid regardless of the original puzzle size.【F:dllm/diffusion_transformer.py†L120-L160】
+Consequently the network must learn to hallucinate outputs for padded regions
+that should remain unused, and the sampling procedure cannot take advantage of
+the sparsity information available in the dataset.
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,13 @@
+[project]
+name = "dllm"
+version = "0.1.0"
+description = "Add your description here"
+readme = "README.md"
+requires-python = ">=3.12"
+dependencies = [
+    "matplotlib>=3.10.7",
+    "numpy>=2.3.3",
+    "pytest>=8.4.2",
+    "torch>=2.8.0",
+    "transformers>=4.57.0",
+]