
Figure 2. Examples of 16 × super-resolution. (a) LR input. (b) ESRGAN [45] which trains a simple end-to-end structure GAN, and loses the
inherent information. (c) GLEAN [4] which achieves more realistic details through additional StyleGAN [16] priors, but still generates unnatural
textures and GAN-specific artifacts. (d) With implicit continuous representation based on a scale-adaptive conditioning mechanism, IDM generates
the output with high-fidelity details and retains the identity of the ground-truth. (e) The ground-truth.
fore, they turn to a complicated cascaded structure [13] or two-
stage training strategies [10, 33, 34] to achieve multiple com-
bined magnifications, or retrain the model for a specific resolu-
tion [35], which brings extra training cost.
To address these issues, this paper presents a novel Implicit
Diffusion Model (IDM) for high-fidelity image SR across a
continuous range of resolutions. We take the merit of diffu-
sion models in synthesizing fine image details to improve the fi-
delity of SR results and introduce the implicit image function to
handle the fixed-resolution limitation. In particular, we formu-
late continuous image super-resolution as a denoising diffusion
process. We leverage the appealing property of implicit neural
representations by encoding an image as a function into a con-
tinuous space. When incorporated into the diffusion model, it
is parameterized by a coordinate-based Multi-Layer Perceptron
(MLP) to capture the resolution-continuous representations of
images better.
At a high level, IDM iteratively leverages the denoising dif-
fusion model and the implicit image function, which is im-
plemented in the upsampling layers of the U-Net architecture.
Fig. 1(d) illustrates that IDM achieves continuously modu-
lated results within a wide range of resolutions. Accordingly,
we develop a scale-adaptive conditioning mechanism consist-
ing of an LR conditioning network and a scaling factor. The
LR conditioning network can encode LR images without pri-
ors and provide multi-resolution features for the iterative de-
noising steps. The scaling factor is introduced for controlling
the output resolution continuously and works through the adap-
tive MLP to adjust how much the encoded LR and generated
features are expressed. It is worth noting that, unlike previ-
ous methods with two-stage synthesis pipelines [9, 13, 33] or
additional priors [4, 26, 44], IDM enjoys an elegant end-to-end
training framework without extra priors. As shown in Fig. 2,
we can observe that IDM outperforms other previous works in
synthesizing photographic image details.
The main contributions of this paper are summarized as fol-
lows:
• We develop an Implicit Diffusion Model (IDM) for
continuous image super-resolution to reconstruct photo-
realistic images in an end-to-end manner. Iterative im-
plicit denoising diffusion is performed to learn resolution-
continuous representations that enhance the high-fidelity
details of SR images.
• We design a scale-adaptive conditioning mechanism to
dynamically adjust the ratio of the realistic information
from LR features and the generated fine details in the dif-
fusion process. This is achieved through an adaptive MLP
when size-varied SR outputs are needed.
• We conduct extensive experiments on key benchmarks for
natural and facial image SR tasks. IDM exhibits state-
of-the-art qualitative and quantitative results compared to
the previous works and yields high-fidelity resolution-
continuous outputs.
2. Related Work
Implicit Neural Representation. In recent years, implicit
neural representations have shown extraordinary capability in
modeling 3D object shapes, synthesizing 3D surfaces of the
scene, and capturing complicated 3D structures [3, 27–29, 36–
38]. Particularly, methods based on Neural Radiance Fields
(NeRF) [2, 28] utilize Multi-Layer Perceptrons (MLPs) to ren-
der 3D-consistent images with refined texture details. Because
of its outstanding performance in 3D tasks, implicit neural rep-
resentations have been extended to 2D images. Instead of pa-
rameterizing 2D shapes with an MLP with ReLU as in early
works [31, 40], SIREN [37] employs periodic activation func-
tions to model high-quality image representations with fast
convergence. LIIF [6] significantly improves the performance
of representing natural and complex images with local latent
code, which can restore images in an arbitrary resolution. How-
ever, the high-resolution results generated by LIIF are con-
strained by prior LR information, resulting in over-smoothing
with high-frequency information lost.
(a) Low-Resolution (b) ESRGAN (c) GLEAN (d) IDM (ours) (e) Ground-Truth
评论