THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig
Microsoft Research
ABSTRACT
We describe Microsoft’s conversational speech recognition system,
in which we combine recent developments in neural-network-based
acoustic and language modeling to advance the state of the art on the
Switchboard recognition task. Inspired by machine learning ensem-
ble techniques, the system uses a range of convolutional and recur-
rent neural networks. I-vector modeling and lattice-free MMI train-
ing provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward run-
ning RNNLMs, and word posterior-based system combination pro-
vide a 20% boost. The best single system uses a ResNet architecture
acoustic model with RNNLM rescoring, and achieves a word error
rate of 6.9% on the NIST 2000 Switchboard task. The combined
system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task.
Index Terms— Conversational speech recognition, convolu-
tional neural networks, recurrent neural networks, VGG, ResNet,
LACE, BLSTM.
1. INTRODUCTION
Recent years have seen a rapid reduction in speech recognition error
rates as a result of careful engineering and optimization of convo-
lutional and recurrent neural networks. While the basic structures
have been well known for a long period [1, 2, 3, 4, 5, 6, 7], it is only
recently that they have dominated the field as the best models for
speech recognition. Surprisingly, this is the case for both acoustic
modeling [8, 9, 10, 11, 12, 13] and language modeling [14, 15]. In
comparison to standard feed-forward MLPs or DNNs, these acoustic
models have the ability to model a large amount of acoustic context
with temporal invariance, and in the case of convolutional models,
with frequency invariance as well. In language modeling, recurrent
models appear to improve over classical N-gram models through the
generalization ability of continuous word representations [16]. In the
meantime, ensemble learning has become commonly used in several
neural models [17, 18, 15], to improve robustness by reducing bias
and variance.
In this paper, we use ensembles of models extensively, as well
as improvements to individual component models, to to advance
the state-of-the-art in conversational telephone speech recognition
(CTS), which has been a benchmark speech recognition task since
the 1990s. The main features of this system are:
1. An ensemble of two fundamental acoustic model architec-
tures, convolutional neural nets (CNNs) and long-short-term
memory nets (LSTMs), with multiple variants of each
2. An attention mechanism in the LACE CNN which differen-
tially weights distant context [19]
3. Lattice-free MMI training [20, 21]
4. The use of i-vector based adaptation [22] in all models
5. Language model (LM) rescoring with multiple, recurrent
neural net LMs [14] running in both forward and reverse
direction
6. Confusion network system combination [23] coupled with
search for best system subset, as necessitated by the large
number of candidate systems.
The remainder of this paper describes our system in detail. Sec-
tion 2 describes the CNN and LSTM models. Section 3 describes
our implementation of i-vector adaptation. Section 4 presents out
lattice-free MMI training process. Language model rescoring is a
significant part of our system, and described in Section 5. Experi-
mental results are presented in Section 6, followed by a discussion
of related work and conclusions.
2. CONVOLUTIONAL AND LSTM NEURAL NETWORKS
We use three CNN variants. The first is the VGG architecture of [24].
Compared to the networks used previously in image recognition, this
network uses small (3x3) filters, is deeper, and applies up to five con-
volutional layers before pooling. The second network is modeled on
the ResNet architecture [25], which adds highway connections [26],
i.e. a linear transform of each layer’s input to the layer’s output
[26, 27]. The only difference is that we move the Batch Normaliza-
tion node to the place right before each ReLU activation.
The last CNN variant is the LACE (layer-wise context expan-
sion with attention) model [19]. LACE is a TDNN [3] variant in
which each higher layer is a weighted sum of nonlinear transforma-
tions of a window of lower layer frames. In other words, each higher
layer exploits broader context than lower layers. Lower layers fo-
cus on extracting simple local patterns while higher layers extract
complex patterns that cover broader contexts. Since not all frames
in a window carry the same importance, an attention mask is ap-
plied. The LACE model differs from the earlier TDNN models e.g.
[3, 28] in the use of a learned attention mask and ResNet like lin-
ear pass-through. As illustrated in detail in Figure 1, the model is
composed of 4 blocks, each with the same architecture. Each block
starts with a convolution layer with stride 2 which sub-samples the
input and increases the number of channels. This layer is followed
by 4 RELU-convolution layers with jump links similar to those used
in ResNet. Table 1 compares the layer structure and parameters of
the three CNN architectures.
While our best performing models are convolutional, the use
of long short-term memory networks is a close second. We use a
bidirectional architecture [29] without frame-skipping [9]. The core
model structure is the LSTM defined in [8]. We found that using net-
works with more than six layers did not improve the word error rate
on the development set, and chose 512 hidden units, per direction,
per layer, as that provided a reasonable trade-off between training
time and final model accuracy. Network parameters for different
configurations of the LSTM architecture are summarized in Table 2.
arXiv:1609.03528v2 [cs.CL] 25 Jan 2017
评论