ArtiFixer: Enhancing and Extending 3D Reconstruction with Autoregressive Diffusion Models

MipNeRF 360 Comparisons

We render novel orbit trajectories and compare ArtiFixer3D+ to its base 3DGUT rendering, GenFusion, and GSFixer on all scenes in MipNeRF 360's most challenging 3 view split. Our quality exceeds, to our knowledge, all other previously published work, including previous SOTA method CAT3D as shown in the examples provided on their website.

DL3DV Comparisons

We compare our method to a variety of generative baselines on sparse reconstructions from the DL3DV-10K dataset.

We compare ArtiFixer3D+ on DL3DV to 3DGUT and two baselines that build upon bidirectional video diffusion models. GenFusion's base model generates 16 frames at a time, requiring a iterative distillation process that leads to blurry results, especially in empty areas. Gen3C's renderings are sharper but often do not respect the source content. Our method can reconstruct plausible and consistent geometry even when the initial rendering is highly degraded.

Nerfbusters Comparisons

As in the other datasets, our method is the only that can generate plausible visuals in unobserved areas while respecting source fidelity.

Conditioning

We drop the initial rendering condition, forcing the model to reconstruct the scene from the reference views. Although fidelity drops somewhat, the high-level structure of the scene remains intact along with the correct camera motion.

Prediction

Ground Truth

Reference Views

Artifixer retains a strong generative ability thanks to opacity mixing and training dropout, and is able to generate videos from text prompts alone, similar to its base model.

Prompt: A bronze statue of two children standing back to back on a stone pedestal inside a modern exhibition hall. The taller child, wearing a dress and carrying a backpack, has one hand resting on the smaller child's shoulder. The shorter child, dressed in shorts and a shirt, stands with hands behind their back. Both figures are detailed with realistic clothing folds and footwear. The surrounding space features a checkered floor, large display banners with Chinese text and images, and a red table with chairs in the background. The lighting is bright and even, highlighting the texture of the statue and the clean, organized environment. A medium shot captures the full composition of the statue and its immediate surroundings.

ArtiFixer Variants

We evaluate three variants of our method: ArtiFixer, which directly renders novel views from the auto-regressive generator, ArtiFixer3D, which distills its outputs back into the underlying 3D representation, and ArtiFixer3D+, which re-applies the auto-regressive model as post-procesessing on top of Artfixer3D (as in Difix3D+). Although all variants produce similar renderings, ArtiFixer's are slightly sharper, while ArtiFixer3D's are even more consistent with the source images at the cost of some blurriness due to its explicit 3D representation. Re-applying the generator to the improved 3D reconstruction (ArtiFixer3D+) restores some of this sharpness, leading to renderings are cripser than ArtiFixer3D and slightly more consistent than ArtiFixer.

Denoising Steps

As our method starts from renderings instead of pure noise, it is able to generate plausible visuals in fewer than four steps in most cases, although sharpness and temporal consistency suffer somewhat in empty areas. To illustrate, we compare different denoising step counts on a slightly shifted trajectory than one previously explored and distilled by into the representation (ArtiFixer3D). Rendering is generally stable across the different denoising counts except for some minor changes near the previously unexplored periphery.