Official implementation of "Incomplete Data, Complete Dynamics: A Diffusion Approach" (ICLR 2026).
TL;DR: A principled diffusion framework for learning physical dynamics from incomplete observations, with theoretical convergence guarantees.
Real-world physical measurements are inherently sparse and incomplete—sensor networks observe only discrete locations, satellites suffer from cloud occlusion, and experimental measurements are constrained by instrumental limitations. This work proposes a theoretically principled diffusion-based framework that learns complete dynamics directly from such incomplete training data.
- Methodical Design: A novel conditional diffusion training paradigm with strategic context-query partitioning tailored for physical dynamics
- Theoretical Guarantee: First theoretical analysis proving diffusion training on incomplete data asymptotically recovers the true complete distribution
- Strong Results: Substantial improvements over baselines on synthetic PDEs and real-world ERA5 climate data, especially in sparse regimes (1-20% coverage)
Given a training dataset containing only partial observations (no complete samples available), learn a model that can reconstruct complete data from partial observations.
For each incomplete sample, we:
- Partition observed data into context (model input) and query (loss calculation) components
- Train a conditional diffusion model to reconstruct query portions given context
- Ensemble multiple context masks at inference for complete reconstruction
The training objective:
Theorem (Key Insight): The model learns meaningful conditional expectations for dimension
In other words:
This means the context-query partitioning strategy must match the observation pattern:
- Pixel-level observations → Pixel-level context sampling
- Block-wise observations → Block-wise context sampling
The method described in this paper is straightforward to implement and requires no special tricks or hyperparameter tuning beyond what is described in the paper. The core algorithm can be summarized as:
# Training loop (pseudocode)
for x_obs, M in dataloader:
t = uniform(0, 1)
noise = randn_like(x_obs)
x_obs_t = M * (alpha_t * x_obs + sigma_t * noise)
# Key: Sample context/query masks following the SAME pattern as observation masks
M_ctx, M_qry = sample_context_query_masks(M) # Must satisfy Principle 1!
x_pred = model(t, M_ctx * x_obs_t, M_ctx)
loss = ((M_qry * (x_pred - x_obs)) ** 2).mean()
loss.backward()The only critical requirement: Your context-query sampling strategy must ensure every dimension has a positive probability of being queried. Match the sampling structure to your observation pattern.
# Single-step sampling (for well-constrained problems)
def impute(x_obs, M, model, K=10, delta=1e-3):
x_delta = alpha_delta * x_obs + sigma_delta * randn_like(x_obs)
predictions = []
for _ in range(K):
M_ctx = sample_context_mask(M)
pred = model(delta, M_ctx * x_delta, M_ctx)
predictions.append(pred)
return mean(predictions)For questions, please open an issue or contact:
- Zihan Zhou: zihanzhou1@link.cuhk.edu.cn
- Tianshu Yu: yutianshu@cuhk.edu.cn
This work was supported by The Chinese University of Hong Kong, Shenzhen and Shanghai Artificial Intelligence Laboratory.
This project is licensed under the MIT License - see the LICENSE file for details.
