04.The Microsoft 2016 Conversational Speech Recognition System.pdf

jenk

181

5页

0次

2021-02-22

40墨值下载

THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM

W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig

Microsoft Research

ABSTRACT

We describe Microsoft’s conversational speech recognition system,

in which we combine recent developments in neural-network-based

acoustic and language modeling to advance the state of the art on the

Switchboard recognition task. Inspired by machine learning ensem-

ble techniques, the system uses a range of convolutional and recur-

rent neural networks. I-vector modeling and lattice-free MMI train-

ing provide signiﬁcant gains for all acoustic model architectures.

Language model rescoring with multiple forward and backward run-

ning RNNLMs, and word posterior-based system combination pro-

vide a 20% boost. The best single system uses a ResNet architecture

acoustic model with RNNLM rescoring, and achieves a word error

rate of 6.9% on the NIST 2000 Switchboard task. The combined

system has an error rate of 6.2%, representing an improvement over

previously reported results on this benchmark task.

Index Terms— Conversational speech recognition, convolu-

tional neural networks, recurrent neural networks, VGG, ResNet,

LACE, BLSTM.

1. INTRODUCTION

Recent years have seen a rapid reduction in speech recognition error

rates as a result of careful engineering and optimization of convo-

lutional and recurrent neural networks. While the basic structures

have been well known for a long period [1, 2, 3, 4, 5, 6, 7], it is only

recently that they have dominated the ﬁeld as the best models for

speech recognition. Surprisingly, this is the case for both acoustic

modeling [8, 9, 10, 11, 12, 13] and language modeling [14, 15]. In

comparison to standard feed-forward MLPs or DNNs, these acoustic

models have the ability to model a large amount of acoustic context

with temporal invariance, and in the case of convolutional models,

with frequency invariance as well. In language modeling, recurrent

models appear to improve over classical N-gram models through the

generalization ability of continuous word representations [16]. In the

meantime, ensemble learning has become commonly used in several

neural models [17, 18, 15], to improve robustness by reducing bias

and variance.

In this paper, we use ensembles of models extensively, as well

as improvements to individual component models, to to advance

the state-of-the-art in conversational telephone speech recognition

(CTS), which has been a benchmark speech recognition task since

the 1990s. The main features of this system are:

1. An ensemble of two fundamental acoustic model architec-

tures, convolutional neural nets (CNNs) and long-short-term

memory nets (LSTMs), with multiple variants of each

2. An attention mechanism in the LACE CNN which differen-

tially weights distant context [19]

3. Lattice-free MMI training [20, 21]

4. The use of i-vector based adaptation [22] in all models

5. Language model (LM) rescoring with multiple, recurrent

neural net LMs [14] running in both forward and reverse

direction

6. Confusion network system combination [23] coupled with

search for best system subset, as necessitated by the large

number of candidate systems.

The remainder of this paper describes our system in detail. Sec-

tion 2 describes the CNN and LSTM models. Section 3 describes

our implementation of i-vector adaptation. Section 4 presents out

lattice-free MMI training process. Language model rescoring is a

signiﬁcant part of our system, and described in Section 5. Experi-

mental results are presented in Section 6, followed by a discussion

of related work and conclusions.

2. CONVOLUTIONAL AND LSTM NEURAL NETWORKS

We use three CNN variants. The ﬁrst is the VGG architecture of [24].

Compared to the networks used previously in image recognition, this

network uses small (3x3) ﬁlters, is deeper, and applies up to ﬁve con-

volutional layers before pooling. The second network is modeled on

the ResNet architecture [25], which adds highway connections [26],

i.e. a linear transform of each layer’s input to the layer’s output

[26, 27]. The only difference is that we move the Batch Normaliza-

tion node to the place right before each ReLU activation.

The last CNN variant is the LACE (layer-wise context expan-

sion with attention) model [19]. LACE is a TDNN [3] variant in

which each higher layer is a weighted sum of nonlinear transforma-

tions of a window of lower layer frames. In other words, each higher

layer exploits broader context than lower layers. Lower layers fo-

cus on extracting simple local patterns while higher layers extract

complex patterns that cover broader contexts. Since not all frames

in a window carry the same importance, an attention mask is ap-

plied. The LACE model differs from the earlier TDNN models e.g.

[3, 28] in the use of a learned attention mask and ResNet like lin-

ear pass-through. As illustrated in detail in Figure 1, the model is

composed of 4 blocks, each with the same architecture. Each block

starts with a convolution layer with stride 2 which sub-samples the

input and increases the number of channels. This layer is followed

by 4 RELU-convolution layers with jump links similar to those used

in ResNet. Table 1 compares the layer structure and parameters of

the three CNN architectures.

While our best performing models are convolutional, the use

of long short-term memory networks is a close second. We use a

bidirectional architecture [29] without frame-skipping [9]. The core

model structure is the LSTM deﬁned in [8]. We found that using net-

works with more than six layers did not improve the word error rate

on the development set, and chose 512 hidden units, per direction,

per layer, as that provided a reasonable trade-off between training

time and ﬁnal model accuracy. Network parameters for different

conﬁgurations of the LSTM architecture are summarized in Table 2.

arXiv:1609.03528v2 [cs.CL] 25 Jan 2017

Fig. 1. LACE network architecture

3. SPEAKER ADAPTIVE MODELING

Speaker adaptive modeling in our system is based on conditioning

the network on an i-vector [30] characterization of each speaker

[22, 31]. A 100-dimensional i-vector is generated for each conver-

sation side. For the LSTM system, the conversation-side i-vector v

is appended to each frame of input. For convolutional networks, this

approach is inappropriate because we do not expect to see spatially

contiguous patterns in the input. Instead, for the CNNs, we add a

learnable weight matrix W

to each layer, and add W

to the ac-

tivation of the layer before the nonlinearity. Thus, in the CNN, the

i-vector essentially serves as an additional bias to each layer. Note

that the i-vectors are estimated using MFCC features; by using them

subsequently in systems based on log-ﬁlterbank features, we may

beneﬁt from a form of feature combination.

4. LATTICE-FREE SEQUENCE TRAINING

After standard cross-entropy training, we optimize the model param-

eters using the maximum mutual information (MMI) objective func-

tion. Denoting a word sequence by w and its corresponding acoustic

realization by a, the training criterion is

w,a∈data

log

P (w)P (a|w)

P (w

)P (a|w

)

As noted in [32, 33], the necessary gradient for use in backpropa-

gation is a simple function of the posterior probability of a particu-

lar acoustic model state at a given time, as computed by summing

over all possible word sequences in an unconstrained manner. As

ﬁrst done in [20], and more recently in [21], this can be accom-

plished with a straightforward alpha-beta computation over the ﬁnite

state acceptor representing the decoding search space. In [20], the

Table 1. Comparison of CNN architectures

Table 2. Bidirectional LSTM conﬁgurations

Hidden-size Output-size i-vectors Depth Parameters

512 9000 N 6 43.0M

512 9000 Y 6 43.4M

512 27000 N 6 61.4M

512 27000 Y 6 61.8M

search space is taken to be an acceptor representing the composi-

tion H CLG for a unigram language model L on words. In [21], a

language model on phonemes is used instead.

In our implementation, we use a mixed-history acoustic unit lan-

guage model. In this model, the probability of transitioning into a

new context-dependent phonetic state (senone) is conditioned both

the senone and phone history. We found this model to perform bet-

ter than either purely word-based or phone-based models. Based on

a set of initial experiments, we developed the following procedure:

1. Perform a forced alignment of the training data to select lexi-

cal variants and determine frame-aligned senone sequences.

2. Compress consecutive framewise occurrences of a single

senone into a single occurrence.

3. Estimate an unsmoothed, variable-length N-gram language

model from this data, where the history state consists of

the previous phone and previous senones within the current

phone.

To illustrate this, consider the sample senone sequence {s s2.1288,

s s3.1061, s s4.1096}, {eh s2.527, eh s3.128, eh s4.66}, {t s2.729,

t s3.572, t s4.748}. When predicting the state following eh s4.66

the history consists of (s, eh s2.527, eh s3.128, eh s4.66), and fol-

lowing t s2.729, the history is (eh, t s2.729).

We construct the denominator graph from this language model,

and HMM transition probabilities as determined by transition-

counting in the senone sequences found in the training data. Our