responsible for this gap: its slower training and inference speed, ineffectiveness in dealing with rare words,
and sometimes failure to translate all words in the source sentence. Firstly, it generally takes a considerable
amount of time and computational resources to train an NMT system on a large-scale translation dataset,
thus slowing the rate of experimental turnaround time and innovation. For inference they are generally
much slower than phrase-based systems due to the large number of parameters used. Secondly, NMT lacks
robustness in translating rare words. Though this can be addressed in principle by training a “copy model” to
mimic a traditional alignment model [
31
], or by using the attention mechanism to copy rare words [
37
], these
approaches are both unreliable at scale, since the quality of the alignments varies across languages, and the
latent alignments produced by the attention mechanism are unstable when the network is deep. Also, simple
copying may not always be the best strategy to cope with rare words, for example when a transliteration
is more appropriate. Finally, NMT systems sometimes produce output sentences that do not translate all
parts of the input sentence – in other words, they fail to completely “cover” the input, which can result in
surprising translations.
This work presents the design and implementation of GNMT, a production NMT system at Google, that
aims to provide solutions to the above problems. In our implementation, the recurrent networks are Long
Short-Term Memory (LSTM) RNNs [
23
,
17
]. Our LSTM RNNs have 8 layers, with residual connections
between layers to encourage gradient flow [
21
]. For parallelism, we connect the attention from the bottom
layer of the decoder network to the top layer of the encoder network. To improve inference time, we
employ low-precision arithmetic for inference, which is further accelerated by special hardware (Google’s
Tensor Processing Unit, or TPU). To effectively deal with rare words, we use sub-word units (also known as
“wordpieces”) [
35
] for inputs and outputs in our system. Using wordpieces gives a good balance between the
flexibility of single characters and the efficiency of full words for decoding, and also sidesteps the need for
special treatment of unknown words. Our beam search technique includes a length normalization procedure
to deal efficiently with the problem of comparing hypotheses of different lengths during decoding, and a
coverage penalty to encourage the model to translate all of the provided input.
Our implementation is robust, and performs well on a range of datasets across many pairs of languages
without the need for language-specific adjustments. Using the same implementation, we are able to achieve
results comparable to or better than previous state-of-the-art systems on standard benchmarks, while
delivering great improvements over Google’s phrase-based production translation system. Specifically, on
WMT’14 English-to-French, our single model scores 38.95 BLEU, an improvement of 7.5 BLEU from a
single model without an external alignment model reported in [
31
] and an improvement of 1.2 BLEU from
a single model without an external alignment model reported in [
45
]. Our single model is also comparable
to a single model in [
45
], while not making use of any alignment model as being used in [
45
]. Likewise on
WMT’14 English-to-German, our single model scores 24.17 BLEU, which is 3.4 BLEU better than a previous
competitive baseline [
6
]. On production data, our implementation is even more effective. Human evaluations
show that GNMT has reduced translation errors by 60% compared to our previous phrase-based system
on many pairs of languages: English
↔
French, English
↔
Spanish, and English
↔
Chinese. Additional
experiments suggest the quality of the resulting translation system gets closer to that of average human
translators.
2 Related Work
Statistical Machine Translation (SMT) has been the dominant translation paradigm for decades [
3
,
4
,
5
].
Practical implementations of SMT are generally phrase-based systems (PBMT) which translate sequences of
words or phrases where the lengths may differ [26].
Even prior to the advent of direct Neural Machine Translation, neural networks have been used as a
component within SMT systems with some success. Perhaps one of the most notable attempts involved the use
of a joint language model to learn phrase representations [
13
] which yielded an impressive improvement when
combined with phrase-based translation. This approach, however, still makes use of phrase-based translation
systems at its core, and therefore inherits their shortcomings. Other proposed approaches for learning phrase
representations [
7
] or learning end-to-end translation with neural networks [
24
] offered encouraging hints, but
ultimately delivered worse overall accuracy compared to standard phrase-based systems.
The concept of end-to-end learning for machine translation has been attempted in the past (e.g., [
8
]) with
2
评论