暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
04.The Microsoft 2016 Conversational Speech Recognition System.pdf
175
5页
0次
2021-02-22
40墨值下载
THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig
Microsoft Research
ABSTRACT
We describe Microsoft’s conversational speech recognition system,
in which we combine recent developments in neural-network-based
acoustic and language modeling to advance the state of the art on the
Switchboard recognition task. Inspired by machine learning ensem-
ble techniques, the system uses a range of convolutional and recur-
rent neural networks. I-vector modeling and lattice-free MMI train-
ing provide significant gains for all acoustic model architectures.
Language model rescoring with multiple forward and backward run-
ning RNNLMs, and word posterior-based system combination pro-
vide a 20% boost. The best single system uses a ResNet architecture
acoustic model with RNNLM rescoring, and achieves a word error
rate of 6.9% on the NIST 2000 Switchboard task. The combined
system has an error rate of 6.2%, representing an improvement over
previously reported results on this benchmark task.
Index Terms Conversational speech recognition, convolu-
tional neural networks, recurrent neural networks, VGG, ResNet,
LACE, BLSTM.
1. INTRODUCTION
Recent years have seen a rapid reduction in speech recognition error
rates as a result of careful engineering and optimization of convo-
lutional and recurrent neural networks. While the basic structures
have been well known for a long period [1, 2, 3, 4, 5, 6, 7], it is only
recently that they have dominated the field as the best models for
speech recognition. Surprisingly, this is the case for both acoustic
modeling [8, 9, 10, 11, 12, 13] and language modeling [14, 15]. In
comparison to standard feed-forward MLPs or DNNs, these acoustic
models have the ability to model a large amount of acoustic context
with temporal invariance, and in the case of convolutional models,
with frequency invariance as well. In language modeling, recurrent
models appear to improve over classical N-gram models through the
generalization ability of continuous word representations [16]. In the
meantime, ensemble learning has become commonly used in several
neural models [17, 18, 15], to improve robustness by reducing bias
and variance.
In this paper, we use ensembles of models extensively, as well
as improvements to individual component models, to to advance
the state-of-the-art in conversational telephone speech recognition
(CTS), which has been a benchmark speech recognition task since
the 1990s. The main features of this system are:
1. An ensemble of two fundamental acoustic model architec-
tures, convolutional neural nets (CNNs) and long-short-term
memory nets (LSTMs), with multiple variants of each
2. An attention mechanism in the LACE CNN which differen-
tially weights distant context [19]
3. Lattice-free MMI training [20, 21]
4. The use of i-vector based adaptation [22] in all models
5. Language model (LM) rescoring with multiple, recurrent
neural net LMs [14] running in both forward and reverse
direction
6. Confusion network system combination [23] coupled with
search for best system subset, as necessitated by the large
number of candidate systems.
The remainder of this paper describes our system in detail. Sec-
tion 2 describes the CNN and LSTM models. Section 3 describes
our implementation of i-vector adaptation. Section 4 presents out
lattice-free MMI training process. Language model rescoring is a
significant part of our system, and described in Section 5. Experi-
mental results are presented in Section 6, followed by a discussion
of related work and conclusions.
2. CONVOLUTIONAL AND LSTM NEURAL NETWORKS
We use three CNN variants. The first is the VGG architecture of [24].
Compared to the networks used previously in image recognition, this
network uses small (3x3) filters, is deeper, and applies up to five con-
volutional layers before pooling. The second network is modeled on
the ResNet architecture [25], which adds highway connections [26],
i.e. a linear transform of each layer’s input to the layer’s output
[26, 27]. The only difference is that we move the Batch Normaliza-
tion node to the place right before each ReLU activation.
The last CNN variant is the LACE (layer-wise context expan-
sion with attention) model [19]. LACE is a TDNN [3] variant in
which each higher layer is a weighted sum of nonlinear transforma-
tions of a window of lower layer frames. In other words, each higher
layer exploits broader context than lower layers. Lower layers fo-
cus on extracting simple local patterns while higher layers extract
complex patterns that cover broader contexts. Since not all frames
in a window carry the same importance, an attention mask is ap-
plied. The LACE model differs from the earlier TDNN models e.g.
[3, 28] in the use of a learned attention mask and ResNet like lin-
ear pass-through. As illustrated in detail in Figure 1, the model is
composed of 4 blocks, each with the same architecture. Each block
starts with a convolution layer with stride 2 which sub-samples the
input and increases the number of channels. This layer is followed
by 4 RELU-convolution layers with jump links similar to those used
in ResNet. Table 1 compares the layer structure and parameters of
the three CNN architectures.
While our best performing models are convolutional, the use
of long short-term memory networks is a close second. We use a
bidirectional architecture [29] without frame-skipping [9]. The core
model structure is the LSTM defined in [8]. We found that using net-
works with more than six layers did not improve the word error rate
on the development set, and chose 512 hidden units, per direction,
per layer, as that provided a reasonable trade-off between training
time and final model accuracy. Network parameters for different
configurations of the LSTM architecture are summarized in Table 2.
arXiv:1609.03528v2 [cs.CL] 25 Jan 2017
Fig. 1. LACE network architecture
3. SPEAKER ADAPTIVE MODELING
Speaker adaptive modeling in our system is based on conditioning
the network on an i-vector [30] characterization of each speaker
[22, 31]. A 100-dimensional i-vector is generated for each conver-
sation side. For the LSTM system, the conversation-side i-vector v
s
is appended to each frame of input. For convolutional networks, this
approach is inappropriate because we do not expect to see spatially
contiguous patterns in the input. Instead, for the CNNs, we add a
learnable weight matrix W
l
to each layer, and add W
l
v
s
to the ac-
tivation of the layer before the nonlinearity. Thus, in the CNN, the
i-vector essentially serves as an additional bias to each layer. Note
that the i-vectors are estimated using MFCC features; by using them
subsequently in systems based on log-filterbank features, we may
benefit from a form of feature combination.
4. LATTICE-FREE SEQUENCE TRAINING
After standard cross-entropy training, we optimize the model param-
eters using the maximum mutual information (MMI) objective func-
tion. Denoting a word sequence by w and its corresponding acoustic
realization by a, the training criterion is
X
w,adata
log
P (w)P (a|w)
P
0
w
P (w
0
)P (a|w
0
)
.
As noted in [32, 33], the necessary gradient for use in backpropa-
gation is a simple function of the posterior probability of a particu-
lar acoustic model state at a given time, as computed by summing
over all possible word sequences in an unconstrained manner. As
first done in [20], and more recently in [21], this can be accom-
plished with a straightforward alpha-beta computation over the finite
state acceptor representing the decoding search space. In [20], the
Table 1. Comparison of CNN architectures
Table 2. Bidirectional LSTM configurations
Hidden-size Output-size i-vectors Depth Parameters
512 9000 N 6 43.0M
512 9000 Y 6 43.4M
512 27000 N 6 61.4M
512 27000 Y 6 61.8M
search space is taken to be an acceptor representing the composi-
tion H CLG for a unigram language model L on words. In [21], a
language model on phonemes is used instead.
In our implementation, we use a mixed-history acoustic unit lan-
guage model. In this model, the probability of transitioning into a
new context-dependent phonetic state (senone) is conditioned both
the senone and phone history. We found this model to perform bet-
ter than either purely word-based or phone-based models. Based on
a set of initial experiments, we developed the following procedure:
1. Perform a forced alignment of the training data to select lexi-
cal variants and determine frame-aligned senone sequences.
2. Compress consecutive framewise occurrences of a single
senone into a single occurrence.
3. Estimate an unsmoothed, variable-length N-gram language
model from this data, where the history state consists of
the previous phone and previous senones within the current
phone.
To illustrate this, consider the sample senone sequence {s s2.1288,
s s3.1061, s s4.1096}, {eh s2.527, eh s3.128, eh s4.66}, {t s2.729,
t s3.572, t s4.748}. When predicting the state following eh s4.66
the history consists of (s, eh s2.527, eh s3.128, eh s4.66), and fol-
lowing t s2.729, the history is (eh, t s2.729).
We construct the denominator graph from this language model,
and HMM transition probabilities as determined by transition-
counting in the senone sequences found in the training data. Our
of 5
40墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜