暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
27. Improved Training of Wasserstein GANs.pdf
355
19页
0次
2021-02-23
50墨值下载
Improved Training of Wasserstein GANs
Ishaan Gulrajani
1
, Faruk Ahmed
1
, Martin Arjovsky
2
, Vincent Dumoulin
1
, Aaron Courville
1,3
1
Montreal Institute for Learning Algorithms
2
Courant Institute of Mathematical Sciences
3
CIFAR Fellow
igul222@gmail.com
{faruk.ahmed,vincent.dumoulin,aaron.courville}@umontreal.ca
ma4371@nyu.edu
Abstract
Generative Adversarial Networks (GANs) are powerful generative models, but
suffer from training instability. The recently proposed Wasserstein GAN (WGAN)
makes significant progress toward stable training of GANs, but can still generate
low-quality samples or fail to converge in some settings. We find that these train-
ing failures are often due to the use of weight clipping in WGAN to enforce a
Lipschitz constraint on the critic, which can lead to pathological behavior. We
propose an alternative method for enforcing the Lipschitz constraint: instead of
clipping weights, penalize the norm of the gradient of the critic with respect to its
input. Our proposed method converges faster and generates higher-quality sam-
ples than WGAN with weight clipping. Finally, our method enables very stable
GAN training: for the first time, we can train a wide variety of GAN architectures
with almost no hyperparameter tuning, including 101-layer ResNets and language
models over discrete data.
1
1 Introduction
Generative Adversarial Networks (GANs) are a powerful class of generative models that cast the
generative modeling problem as a game between two adversary networks: the generator network
produces synthetic data given some noise source and the discriminator network discriminates be-
tween the generator’s output and true data. GANs typically produce very visually appealing sam-
ples, but are often very hard to train, and much of the recent work on the subject (Salimans et al.,
2016; Arjovsky et al., 2017; Nowozin et al., 2016; Poole et al., 2016; Metz et al., 2016) has been
devoted to finding ways of stabilizing training. Despite this, consistently stable training of GANs
remains an open problem.
In particular, Arjovsky & Bottou (2017) provide an insightful analysis of the convergence properties
of the value function being optimized by GANs. Their proposed alternative, named Wasserstein
GAN (WGAN) (Arjovsky et al., 2017), leverages the Wasserstein distance to produce a value func-
tion which has better theoretical properties than Jensen-Shannon divergence-based value functions.
This new value function gives rise to the additional requirement that the discriminator (referred to in
that work as the critic) must lie within the space of 1-Lipschitz functions, which the authors choose
to enforce through weight clipping.
In this paper, we propose an alternative to weight clipping in the WGAN discriminator. Our contri-
butions are as follows:
1. Through experiments on toy datasets, we outline the ways in which weight clipping in the
discriminator can lead to pathological behavior which hurts stability and performance.
1
Code for all of our models is available at https://github.com/igul222/improved wgan training.
arXiv:1704.00028v1 [cs.LG] 31 Mar 2017
2. We propose WGAN with gradient penalty, which does not suffer from the same issues.
3. We show that our method converges faster and generates higher-quality samples than stan-
dard WGAN.
4. We show that our method enables very stable GAN training: with almost no hyperparame-
ter tuning, we can successfully train a wide variety of difficult GAN architectures for image
generation and language modeling.
2 Background
2.1 Generative adversarial networks
The GAN training strategy is to define a game between two competing networks. The generator
network maps a source of noise to the input space. The discriminator network receives either a
generated sample or a true data sample and must distinguish between the two. The generator is
trained to fool the discriminator.
More formally we can express the game between the generator G and the discriminator D with the
minimax objective:
min
G
max
D
E
xP
r
[log(D(x))] + E
˜
xP
g
[log(1 D(
˜
x))].
(1)
where P
r
is the data distribution and P
g
is the model distribution implicitly defined by
˜
x =
G(z), z p(z) (the input z to the generator is sampled from some simple noise distribution,
such as the uniform distribution or a spherical Gaussian distribution).
If the discriminator is trained to optimality before each generator parameter update, then minimiz-
ing the value function amounts to minimizing the Jensen-Shannon divergence between the data and
model distributions on x. Doing so is expensive and often leads to vanishing gradients as the dis-
criminator saturates; in practice, this requirement is relaxed, and the generator and the discriminator
are updated simultaneously. The consequence of this relaxation is that generator updates minimize
a stochastic lower-bound to the JS-divergence (Goodfellow, 2014; Poole et al., 2016). Minimizing a
lower bound can lead to meaningless gradient updates, since pushing down the lower bound doesn’t
imply that the loss is actually decreasing, even as the bound goes to 0. This inherent problem in
GANs of trading off unreliable updates and vanishing gradients is one of the main causes of GAN
instability, as thoroughly explored in Arjovsky & Bottou (2017). As shown in Arjovsky et al. (2017),
Wasserstein GANs don’t suffer from this inherent problem.
Again, the GAN value function by itself is hard to optimize: a discriminator confident in its predic-
tions sees its gradient with respect to its input vanish, which is especially hurtful early on in training.
This is why training the discriminator closer to optimality typically degrades the training procedure.
To circumvent this difficulty, the generator is usually trained to maximize E
˜
xP
g
[log(D(
˜
x))] in-
stead. However, this loss function was shown to misbehave as well, in the presence of a good
discriminator (Arjovsky & Bottou, 2017).
2.2 Wasserstein GANs
At a more fundamental level, Arjovsky et al. (2017) argue that the Jensen-Shannon divergence,
along with other common distances and divergences, are potentially not continuous and thus do not
provide a usable gradient for the generator.
An alternative is proposed in the form of the Earth-Mover (also called Wasserstein-1) distance
W (q, p), which is informally defined as the minimum cost of transporting mass in order to transform
the distribution q into the distribution p (where the cost is mass times transport distance). The Earth-
Mover distance is shown to have the desirable property that under mild assumptions it is continuous
everywhere and differentiable almost everywhere.
The value function for a WGAN is constructed by applying the Kantorovich-Rubinstein duality
(Villani, 2008) to obtain
min
G
max
D∈D
E
xP
r
D(x)
E
˜
xP
g
D(
˜
x))
(2)
2
of 19
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜