2. We propose WGAN with gradient penalty, which does not suffer from the same issues.
3. We show that our method converges faster and generates higher-quality samples than stan-
dard WGAN.
4. We show that our method enables very stable GAN training: with almost no hyperparame-
ter tuning, we can successfully train a wide variety of difficult GAN architectures for image
generation and language modeling.
2 Background
2.1 Generative adversarial networks
The GAN training strategy is to define a game between two competing networks. The generator
network maps a source of noise to the input space. The discriminator network receives either a
generated sample or a true data sample and must distinguish between the two. The generator is
trained to fool the discriminator.
More formally we can express the game between the generator G and the discriminator D with the
minimax objective:
min
G
max
D
E
x∼P
r
[log(D(x))] + E
˜
x∼P
g
[log(1 − D(
˜
x))].
(1)
where P
r
is the data distribution and P
g
is the model distribution implicitly defined by
˜
x =
G(z), z ∼ p(z) (the input z to the generator is sampled from some simple noise distribution,
such as the uniform distribution or a spherical Gaussian distribution).
If the discriminator is trained to optimality before each generator parameter update, then minimiz-
ing the value function amounts to minimizing the Jensen-Shannon divergence between the data and
model distributions on x. Doing so is expensive and often leads to vanishing gradients as the dis-
criminator saturates; in practice, this requirement is relaxed, and the generator and the discriminator
are updated simultaneously. The consequence of this relaxation is that generator updates minimize
a stochastic lower-bound to the JS-divergence (Goodfellow, 2014; Poole et al., 2016). Minimizing a
lower bound can lead to meaningless gradient updates, since pushing down the lower bound doesn’t
imply that the loss is actually decreasing, even as the bound goes to 0. This inherent problem in
GANs of trading off unreliable updates and vanishing gradients is one of the main causes of GAN
instability, as thoroughly explored in Arjovsky & Bottou (2017). As shown in Arjovsky et al. (2017),
Wasserstein GANs don’t suffer from this inherent problem.
Again, the GAN value function by itself is hard to optimize: a discriminator confident in its predic-
tions sees its gradient with respect to its input vanish, which is especially hurtful early on in training.
This is why training the discriminator closer to optimality typically degrades the training procedure.
To circumvent this difficulty, the generator is usually trained to maximize E
˜
x∼P
g
[log(D(
˜
x))] in-
stead. However, this loss function was shown to misbehave as well, in the presence of a good
discriminator (Arjovsky & Bottou, 2017).
2.2 Wasserstein GANs
At a more fundamental level, Arjovsky et al. (2017) argue that the Jensen-Shannon divergence,
along with other common distances and divergences, are potentially not continuous and thus do not
provide a usable gradient for the generator.
An alternative is proposed in the form of the Earth-Mover (also called Wasserstein-1) distance
W (q, p), which is informally defined as the minimum cost of transporting mass in order to transform
the distribution q into the distribution p (where the cost is mass times transport distance). The Earth-
Mover distance is shown to have the desirable property that under mild assumptions it is continuous
everywhere and differentiable almost everywhere.
The value function for a WGAN is constructed by applying the Kantorovich-Rubinstein duality
(Villani, 2008) to obtain
min
G
max
D∈D
E
x∼P
r
D(x)
− E
˜
x∼P
g
D(
˜
x))
(2)
2
评论