用核预测网络去噪.pdf

poPoq

289

11页

0次

2021-05-02

50墨值下载

Burst Denoising with Kernel Prediction Networks

Ben Mildenhall

1,2

∗

Jonathan T. Barron

Jiawen Chen

Dillon Sharlet

Ren Ng

Robert Carroll

UC Berkeley

Google Research

Abstract

We present a technique for jointly denoising bursts of im-

ages taken from a handheld camera. In particular, we pro-

pose a convolutional neural network architecture for pre-

dicting spatially varying kernels that can both align and de-

noise frames, a synthetic data generation approach based

on a realistic noise formation model, and an optimization

guided by an annealed loss function to avoid undesirable

local minima. Our model matches or outperforms the state-

of-the-art across a wide range of noise levels on both real

and synthetic data.

1. Introduction

The task of image denoising is foundational to the study

of imaging and computer vision. Traditionally, the prob-

lem of single-image denoising has been addressed as one

of statistical inference using analytical priors [20, 22], but

recent work has built on the success of deep learning by us-

ing convolutional neural networks that learn mappings from

noisy images to noiseless images by training on millions of

examples [29]. These networks appear to learn the likely

appearance of “ground truth” noiseless images in addition

to the statistical properties of the noise present in the input

images.

Multiple-image denoising has also traditionally been ap-

proached through the lens of classical statistical inference,

under the assumption that averaging multiple noisy and in-

dependent samples of a signal will result in a more accu-

rate estimate of the true underlying signal. However, when

denoising image bursts taken with handheld cameras, sim-

ple temporal averaging yields poor results because of scene

and camera motion. Many techniques attempt to ﬁrst align

the burst or include some notion of translation-invariance

within the denoising operator itself [8]. The idea of de-

noising by combining multiple aligned image patches is

also key to many of the most successful single image tech-

niques [3, 4], which rely on the self-similarity of a single

∗

Work done while interning at Google.

image to allow some degree of denoising via averaging.

We propose a method for burst denoising with the signal-

to-noise ratio beneﬁts of multi-image denoising and the

large capacity and generality of convolutional neural net-

works. Our model is capable of matching or outperforming

the state-of-the-art at all noise levels on both synthetic and

real data. Our contributions include:

1. A procedure for converting post-processed images

taken from the internet into data with the character-

istics of raw linear data captured by real cameras. This

lets us to train a model that generalizes to real images

and circumvents the difﬁculties in acquiring ground

truth data for our task from a camera.

2. A network architecture that outperforms the state-

of-the-art on synthetic and real data by predicting a

unique 3D denoising kernel to produce each pixel of

the output image. This provides both a performance

improvement over a network that synthesizes pixels di-

rectly, and a way to visually inspect how each burst

image is being used.

3. A training procedure for our kernel prediction network

that allows it to predict ﬁlter kernels that use infor-

mation from multiple images even in the presence of

small unknown misalignments.

4. A demonstration that a network that takes the noise

level of the input imagery as input during training and

testing generalizes to a much wider range of noise lev-

els than a blind denoising network.

2. Related work

Single-image denoising is a longstanding problem, origi-

nating with classical methods like anisotropic diffusion [20]

or total variation denoising [22], which used analytical pri-

ors and non-linear optimization to recover a signal from a

noisy image. These ideas were built upon to develop multi-

image or video denoising techniques such as VBM4D [17]

and non-local means [3, 14], which group similar patches

across time and jointly ﬁlter them under the assumption

that multiple noisy observations can be averaged to better

estimate the true underlying signal. Recently these ideas

arXiv:1712.02327v2 [cs.CV] 29 Mar 2018

(a) Reference (b) Average (c) HDR+ (d) NLM (e) VBM4D (f) Ours (KPN)Reference frame

Figure 1: A qualitative evaluation of our model on real image bursts from a handheld camera in a low-light environment.

The reference frame from the input burst (a) is sharp, but noisy. Noise can be reduced by simply averaging a burst of similar

images (b), but this can fail in the presence of motion (see Figure 8). Our approach (f) learns to use the information present

in the entire burst to denoise a single frame, producing lower noise and avoiding artifacts compared to baseline techniques (c

– e). See the supplement for full resolution images and more examples.

have been retargeted towards the task of denoising a burst

of noisy images captured from commodity mobile phones,

with an emphasis on energy efﬁciency and speed [8, 16].

These approaches ﬁrst align image patches to within a few

pixels and then perform joint denoising by robust averag-

ing (such as Wiener ﬁltering). Another line of work has

focused on achieving high quality by combining multiple

image formation steps with a single linear operator and us-

ing modern optimization techniques to solve the associated

inverse problem [11, 10]. These approaches generalize to

multiple image denoising but require calculating alignment

as part of the forward model.

The success of deep learning has yielded a number of

neural network approaches to multi-image denoising [29,

27], in addition to a wide range of similar tasks such as joint

denoising and demosaicking [7], deblurring [24], and su-

perresolution [25]. Similar in spirit to our method, Kernel-

Predicting Networks [2] denoise Monte Carlo renderings

with a network that generates a ﬁlter for every pixel in

the desired output, which constrains the output space and

thereby prevents artifacts. Similar ideas have been applied

successfully to both video interpolation [18, 19] and video

prediction [6, 15, 28, 5], where applying predicted optical

ﬂow vectors or ﬁlters to the input image data helps prevent

the blurry outputs often produced by direct pixel synthesis

networks.

3. Problem speciﬁcation

Our goal is to produce a single clean image from a noisy

burst of N images captured by a handheld camera. Fol-

lowing the design of recent work [8], we select one image

in the burst as the “reference” and denoise it with the

help of “alternate” frames X

, . . . , X

. It is not necessary

for X

to be the ﬁrst image acquired. All input images

are in the raw linear domain to avoid losing signal due to

the post-processing performed between capture and display

(e.g., demosaicking, sharpening, tone mapping, and com-

pression). Creating training examples for this task requires

careful consideration of the characteristics of raw sensor

data.

3.1. Characteristics of raw sensor data

Camera sensors output raw data in a linear color space,

where pixel measurements are proportional to the number

of photoelectrons collected. The primary sources of noise

are shot noise, a Poisson process with variance equal to the

signal level, and read noise, an approximately Gaussian pro-

cess caused by a variety of sensor readout effects. These ef-

fects are well-modeled by a signal-dependent Gaussian dis-

tribution [9]:

∼ N



, σ

+ σ



(1)

where x

is a noisy measurement of the true intensity y

at pixel p. The noise parameters σ

and σ

are ﬁxed for

of 11

50墨值下载

自动驾驶

关注

评论