DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis

📝 Abstract

Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. We present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges from a new perspective, employing a generation-based solution that avoids the problems associated with flow-based warping. We propose a Dynamics-aware Diffusion Training approach with a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds. We also develop the DSIA dataset with 1,033 scenes and 30K+ image pairs tailored for image alignment. Extensive experiments demonstrate superiority on DSIA, Sintel, and DAVIS benchmarks.

🔍 Motivation

DMAligner teaser - comparing flow-based warping vs diffusion-based view synthesis

Classic alignment

Optical flow + warping moves pixels from one frame to another, but occlusions and disocclusions often lead to ghosting artifacts.

Our perspective

DMAligner synthesizes the aligned view directly, allowing the model to fill occluded regions and reconstruct dynamic content.

📊 DSIA Dataset

The Dynamic Scene Image Alignment (DSIA) dataset is the first large-scale dataset specifically designed for image alignment. Built with Blender, it includes 1,033 indoor and outdoor scenes with over 30,000 image pairs at 960×540 resolution.

1,033

Scenes

30K+

Image Pairs

Characters

100

Objects

DSIA renders the target aligned image at the reference camera pose and target time, enabling clean supervision.

1. Build dynamic scenes with characters, objects, and environments.

2. Move camera and foreground objects simultaneously.

3. Render paired inputs and the alignment-oriented ground truth.

⚙️ DMAligner Framework

Overview of DMAligner: dynamics-aware conditioning guides diffusion-based image alignment.

Latent Diffusion

Perform denoising in latent space and predict the aligned image representation directly.

DMP Mask

Estimate dynamic regions from cross-frame latent correlation to focus on hard areas.

Condition Fusion

Combine static cues from the reference frame with dynamic cues from the target-time frame.

🎬 Experiments: GIF Comparisons

To make the qualitative results more subjective and easier to inspect, each demo is shown as animated GIF comparisons rather than only static input/prediction images.

How to read the demos: the left GIF shows the raw input misalignment, while the two right GIFs alternate between one input frame and the DMAligner prediction. Stable regions and reduced ghosting indicate better alignment quality.

📚 Citation

@inproceedings{luo2026dmaligner, title={DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis}, author={Luo, Xinglong and Luo, Ao and Wang, Zhengning and Yang, Yueqi and Feng, Chaoyu and Lei, Lei and Zeng, Bing and Liu, Shuaicheng}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month={June}, year={2026}, pages= {16541-16550} }