27. Improved Training of Wasserstein GANs.pdf

aKun

362

19页

0次

2021-02-23

50墨值下载

Improved Training of Wasserstein GANs

Ishaan Gulrajani

, Faruk Ahmed

, Martin Arjovsky

, Vincent Dumoulin

, Aaron Courville

1,3

Montreal Institute for Learning Algorithms

Courant Institute of Mathematical Sciences

CIFAR Fellow

igul222@gmail.com

{faruk.ahmed,vincent.dumoulin,aaron.courville}@umontreal.ca

ma4371@nyu.edu

Abstract

Generative Adversarial Networks (GANs) are powerful generative models, but

suffer from training instability. The recently proposed Wasserstein GAN (WGAN)

makes signiﬁcant progress toward stable training of GANs, but can still generate

low-quality samples or fail to converge in some settings. We ﬁnd that these train-

ing failures are often due to the use of weight clipping in WGAN to enforce a

Lipschitz constraint on the critic, which can lead to pathological behavior. We

propose an alternative method for enforcing the Lipschitz constraint: instead of

clipping weights, penalize the norm of the gradient of the critic with respect to its

input. Our proposed method converges faster and generates higher-quality sam-

ples than WGAN with weight clipping. Finally, our method enables very stable

GAN training: for the ﬁrst time, we can train a wide variety of GAN architectures

with almost no hyperparameter tuning, including 101-layer ResNets and language

models over discrete data.

1 Introduction

Generative Adversarial Networks (GANs) are a powerful class of generative models that cast the

generative modeling problem as a game between two adversary networks: the generator network

produces synthetic data given some noise source and the discriminator network discriminates be-

tween the generator’s output and true data. GANs typically produce very visually appealing sam-

ples, but are often very hard to train, and much of the recent work on the subject (Salimans et al.,

2016; Arjovsky et al., 2017; Nowozin et al., 2016; Poole et al., 2016; Metz et al., 2016) has been

devoted to ﬁnding ways of stabilizing training. Despite this, consistently stable training of GANs

remains an open problem.

In particular, Arjovsky & Bottou (2017) provide an insightful analysis of the convergence properties

of the value function being optimized by GANs. Their proposed alternative, named Wasserstein

GAN (WGAN) (Arjovsky et al., 2017), leverages the Wasserstein distance to produce a value func-

tion which has better theoretical properties than Jensen-Shannon divergence-based value functions.

This new value function gives rise to the additional requirement that the discriminator (referred to in

that work as the critic) must lie within the space of 1-Lipschitz functions, which the authors choose

to enforce through weight clipping.

In this paper, we propose an alternative to weight clipping in the WGAN discriminator. Our contri-

butions are as follows:

1. Through experiments on toy datasets, we outline the ways in which weight clipping in the

discriminator can lead to pathological behavior which hurts stability and performance.

Code for all of our models is available at https://github.com/igul222/improved wgan training.

arXiv:1704.00028v1 [cs.LG] 31 Mar 2017

2. We propose WGAN with gradient penalty, which does not suffer from the same issues.

3. We show that our method converges faster and generates higher-quality samples than stan-

dard WGAN.

4. We show that our method enables very stable GAN training: with almost no hyperparame-

ter tuning, we can successfully train a wide variety of difﬁcult GAN architectures for image

generation and language modeling.

2 Background

2.1 Generative adversarial networks

The GAN training strategy is to deﬁne a game between two competing networks. The generator

network maps a source of noise to the input space. The discriminator network receives either a

generated sample or a true data sample and must distinguish between the two. The generator is

trained to fool the discriminator.

More formally we can express the game between the generator G and the discriminator D with the

minimax objective:

min

max

x∼P

[log(D(x))] + E

x∼P

[log(1 − D(

x))].

(1)

where P

is the data distribution and P

is the model distribution implicitly deﬁned by

x =

G(z), z ∼ p(z) (the input z to the generator is sampled from some simple noise distribution,

such as the uniform distribution or a spherical Gaussian distribution).

If the discriminator is trained to optimality before each generator parameter update, then minimiz-

ing the value function amounts to minimizing the Jensen-Shannon divergence between the data and

model distributions on x. Doing so is expensive and often leads to vanishing gradients as the dis-

criminator saturates; in practice, this requirement is relaxed, and the generator and the discriminator

are updated simultaneously. The consequence of this relaxation is that generator updates minimize

a stochastic lower-bound to the JS-divergence (Goodfellow, 2014; Poole et al., 2016). Minimizing a

lower bound can lead to meaningless gradient updates, since pushing down the lower bound doesn’t

imply that the loss is actually decreasing, even as the bound goes to 0. This inherent problem in

GANs of trading off unreliable updates and vanishing gradients is one of the main causes of GAN

instability, as thoroughly explored in Arjovsky & Bottou (2017). As shown in Arjovsky et al. (2017),

Wasserstein GANs don’t suffer from this inherent problem.

Again, the GAN value function by itself is hard to optimize: a discriminator conﬁdent in its predic-

tions sees its gradient with respect to its input vanish, which is especially hurtful early on in training.

This is why training the discriminator closer to optimality typically degrades the training procedure.

To circumvent this difﬁculty, the generator is usually trained to maximize E

x∼P

[log(D(

x))] in-

stead. However, this loss function was shown to misbehave as well, in the presence of a good

discriminator (Arjovsky & Bottou, 2017).

2.2 Wasserstein GANs

At a more fundamental level, Arjovsky et al. (2017) argue that the Jensen-Shannon divergence,

along with other common distances and divergences, are potentially not continuous and thus do not

provide a usable gradient for the generator.

An alternative is proposed in the form of the Earth-Mover (also called Wasserstein-1) distance

W (q, p), which is informally deﬁned as the minimum cost of transporting mass in order to transform

the distribution q into the distribution p (where the cost is mass times transport distance). The Earth-

Mover distance is shown to have the desirable property that under mild assumptions it is continuous

everywhere and differentiable almost everywhere.

The value function for a WGAN is constructed by applying the Kantorovich-Rubinstein duality

(Villani, 2008) to obtain

min

max

D∈D

x∼P



D(x)



− E

x∼P



x))



(2)

of 19

50墨值下载

database

关注

评论