Enhancing Image Alignment via Diffusion Model Based View Synthesis
Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. We present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges from a new perspective, employing a generation-based solution that avoids the problems associated with flow-based warping. We propose a Dynamics-aware Diffusion Training approach with a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds. We also develop the DSIA dataset with 1,033 scenes and 30K+ image pairs tailored for image alignment. Extensive experiments demonstrate superiority on DSIA, Sintel, and DAVIS benchmarks.
Optical flow + warping moves pixels from one frame to another, but occlusions and disocclusions often lead to ghosting artifacts.
DMAligner synthesizes the aligned view directly, allowing the model to fill occluded regions and reconstruct dynamic content.
The Dynamic Scene Image Alignment (DSIA) dataset is the first large-scale dataset specifically designed for image alignment. Built with Blender, it includes 1,033 indoor and outdoor scenes with over 30,000 image pairs at 960×540 resolution.
DSIA renders the target aligned image at the reference camera pose and target time, enabling clean supervision.
Overview of DMAligner: dynamics-aware conditioning guides diffusion-based image alignment.
Perform denoising in latent space and predict the aligned image representation directly.
Estimate dynamic regions from cross-frame latent correlation to focus on hard areas.
Combine static cues from the reference frame with dynamic cues from the target-time frame.
To make the qualitative results more subjective and easier to inspect, each demo is shown as animated GIF comparisons rather than only static input/prediction images.