TRANS-BLSTM Transformer with Bidirectional LSTM for Language.pdf

smith0907

139

9页

0次

2023-10-26

50墨值下载

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language

Understanding

Zhiheng Huang

Amazon AWS AI

zhiheng@amazon.com

Peng Xu

Amazon AWS AI

pengx@amazon.com

Davis Liang

Amazon AWS AI

liadavis@amazon.com

Ajay Mishra

Amazon AWS AI

misaja@amazon.com

Bing Xiang

Amazon AWS AI

bxiang@amazon.com

Abstract

Bidirectional Encoder Representations from

Transformers (BERT) has recently achieved

state-of-the-art performance on a broad range

of NLP tasks including sentence classiﬁca-

tion, machine translation, and question answer-

ing. The BERT model architecture is de-

rived primarily from the transformer. Prior to

the transformer era, bidirectional Long Short-

Term Memory (BLSTM) has been the domi-

nant modeling architecture for neural machine

translation and question answering. In this pa-

per, we investigate how these two modeling

techniques can be combined to create a more

powerful model architecture. We propose

a new architecture denoted as Transformer

with BLSTM (TRANS-BLSTM) which has a

BLSTM layer integrated to each transformer

block, leading to a joint modeling framework

for transformer and BLSTM. We show that

TRANS-BLSTM models consistently lead to

improvements in accuracy compared to BERT

baselines in GLUE and SQuAD 1.1 experi-

ments. Our TRANS-BLSTM model obtains

an F1 score of 94.01% on the SQuAD 1.1 de-

velopment dataset, which is comparable to the

state-of-the-art result.

1 Introduction

Learning representations (Mikolov et al., 2013) of

natural language and language model pre-training

(Devlin et al., 2018; Radford et al., 2019) has

shown promising results recently. These pre-

trained models serve as generic up-stream models

and they can be used to improve down-stream ap-

plications such as natural language inference, para-

phrasing, named entity recognition, and question

answering. The innovation of BERT (Devlin et al.,

2018) comes from the “masked language model”

with a pre-training objective, inspired by the Cloze

task (Taylor, 1953). The masked language model

randomly masks some of the tokens from the input,

and the objective is to predict the original token

based only on its context.

Follow-up work including RoBERTa (Liu

et al., 2019b) investigated hyper-parameter design

choices and suggested longer model training time.

In addition, XLNet (Yang et al., 2019) has been

proposed to address the BERT pre-training and

ﬁne-tuning discrepancy where masked tokens were

found in the former but not in the latter. Nearly

all existing work suggests that a large network is

crucial to achieve the state-of-the-art performance.

For example, (Devlin et al., 2018) has shown that

across natural language understanding tasks, using

larger hidden layer size, more hidden layers, and

more attention heads always leads to better perfor-

mance. However, they stop at a hidden layer size

of 1024. ALBERT (Lan et al., 2019) showed that

it is not the case that simply increasing the model

size would lead to better accuracy. In fact, they

observed that simply increasing the hidden layer

size of a model such as BERT-large can lead to sig-

niﬁcantly worse performance. On the other hand,

model distillation (Hinton et al., 2015; Tang et al.,

2019; Sun et al., 2019; Sanh et al., 2019) has been

proposed to reduce the BERT model size while

maintaining high performance.

In this paper, we attempt to improve the per-

formance of BERT via architecture enhancement.

BERT is based on the encoder of the trans-

former model (Vaswani et al., 2017), which has

been proven to obtain state-of-the-art accuracy

across a broad range of NLP applications (Devlin

et al., 2018). Prior to BERT, bidirectional LSTM

(BLSTM) has dominated sequential modeling for

many tasks including machine translation (Chiu

and Nichols, 2016) and speech recognition (Graves

et al., 2013). Given both models have demonstrated

superior accuracy on various benchmarks, it is nat-

ural to raise the question whether a combination of

the transformer and BLSTM can outperform each

arXiv:2003.07000v1 [cs.CL] 16 Mar 2020

individual architecture. In this paper, we attempt

to answer this question by proposing a transformer

BLSTM joint modeling framework. Our major con-

tribution in this paper is two fold: 1) We propose

the TRANS-BLSTM model architectures, which

combine the transformer and BLSTM into one sin-

gle modeling framework, leveraging the modeling

capability from both the transformer and BLSTM.

2) We show that the TRANS-BLSTM models can

effectively boost the accuracy of BERT baseline

models on SQuAD 1.1 and GLUE NLP benchmark

datasets.

2 Related work

2.1 BERT

Our work focuses on improving the transformer ar-

chitecture (Vaswani et al., 2017), which motivated

the recent breakthrough in language representa-

tion, BERT (Devlin et al., 2018). Our work builds

on top of the transformer architecture, integrating

each transformer block with a bidirectional LSTM

(Hochreiter and Schmidhuber, 1997). Related to

our work, XLNet (Yang et al., 2019) proposes two-

stream self-attention as opposed to single-stream

self-attention used in classic transformers. With

two-stream attention, XLNet can be treated as a

general language model that does not suffer from

the pretrain-ﬁnetune discrepancy (the mask tokens

are seen during pretraining but not during ﬁnetun-

ing) thanks to its autoregressive formulation. Our

method overcomes this limitation with a different

approach, using single-stream self-attention with

an integrated BLSTM layer for each transformer

layer.

2.2 Bidirectional LSTM

The LSTM network (Hochreiter and Schmidhu-

ber, 1997) has demonstrated powerful modeling

capability in sequential learning tasks including

named entity tagging (Huang et al., 2015; Chiu

and Nichols, 2016), machine translation (Bahdanau

et al., 2015; Wu et al., 2016) and speech recogni-

tion (Graves et al., 2013; Sak et al., 2014). The

motivation of this paper is to integrate bidirectional

LSTM layers to the transformer model to further

improve transformer performance. The work of

(Tang et al., 2019) attempts to distill a BERT model

to a single-layer bidirectional LSTM model. It is

relevant to our work as both utilizing bidirectional

LSTM. However, their work leads to inferior accu-

racy compared to BERT baseline models. Similar

to their observation, we show that in our experi-

ments, the use of BLSTM model alone (even with

multiple stacked BLSTM layers) leads to signif-

icantly worse results compared to BERT models.

However, our proposed joint modeling framework,

TRANS-BLSTM, is able to boost the accuracy of

the transformer BERT models.

2.3 Combine Recurrent Network and

Transformer

Previous work has explored the combination of the

recurrent network and transformer. For example,

(Lei et al., 2018) has substituted the feedforward

network in transformer with the simple recurrent

unit (SRU) implementation and achieved better ac-

curacy in machine translation. It is similar to one

of the proposed models in this paper. However, the

difference is that our paper investigates the gain

of the combination in BERT pre-training context,

while their paper focused on the parallelization

speedup of SRU in machine translation encoder

and decoder context.

3 TRANS and Proposed

TRANS-BLSTM Architectures

In this section, we ﬁrst review the transformer archi-

tecture, then propose the transformer bidirectional

LSTM network architectures (TRANS-BLSTM),

which integrates the BLSTM to either the trans-

former encoder or decoder.

3.1 Transformer architecture (TRANS)

The BERT model consists of a transformer en-

coder (Vaswani et al., 2017) as shown in Figure

1. The original transformer architecture uses mul-

tiple stacked self-attention layers and point-wise

fully connected layers for both the encoder and de-

coder. However, BERT only leverages the encoder

to generate hidden value representation and the

original transformer decoder (for generating text

in neural machine translation etc.) is replaced by

a linear layer followed by a softmax layer, shown

in Figure 1, both for sequential classiﬁcation tasks

(named entity tagging, question answering) and

sentence classiﬁcation tasks (sentiment classiﬁca-

tion etc.). The encoder is composed of a stack of

N = 12

N = 24

layers for the BERT-base and

-large cases respectively. Each layer consists of two

sub-layers. The ﬁrst sub-layer is a multi-head self-

attention mechanism, and the second sub-layer is a

simple, position-wise fully connected feed-forward

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language

Understanding

Zhiheng Huang

Amazon AWS AI

zhiheng@amazon.com

Peng Xu

Amazon AWS AI

pengx@amazon.com

Davis Liang

Amazon AWS AI

liadavis@amazon.com

Ajay Mishra

Amazon AWS AI

misaja@amazon.com

Bing Xiang

Amazon AWS AI

bxiang@amazon.com

Abstract

Bidirectional Encoder Representations from

Transformers (BERT) has recently achieved

state-of-the-art performance on a broad range

of NLP tasks including sentence classiﬁca-

tion, machine translation, and question answer-

ing. The BERT model architecture is de-

rived primarily from the transformer. Prior to

the transformer era, bidirectional Long Short-

Term Memory (BLSTM) has been the domi-

nant modeling architecture for neural machine

translation and question answering. In this pa-

per, we investigate how these two modeling

techniques can be combined to create a more

powerful model architecture. We propose

a new architecture denoted as Transformer

with BLSTM (TRANS-BLSTM) which has a

BLSTM layer integrated to each transformer

block, leading to a joint modeling framework

for transformer and BLSTM. We show that

TRANS-BLSTM models consistently lead to

improvements in accuracy compared to BERT

baselines in GLUE and SQuAD 1.1 experi-

ments. Our TRANS-BLSTM model obtains

an F1 score of 94.01% on the SQuAD 1.1 de-

velopment dataset, which is comparable to the

state-of-the-art result.

1 Introduction

Learning representations (Mikolov et al., 2013) of

natural language and language model pre-training

(Devlin et al., 2018; Radford et al., 2019) has

shown promising results recently. These pre-

trained models serve as generic up-stream models

and they can be used to improve down-stream ap-

plications such as natural language inference, para-

phrasing, named entity recognition, and question

answering. The innovation of BERT (Devlin et al.,

2018) comes from the “masked language model”

with a pre-training objective, inspired by the Cloze

task (Taylor, 1953). The masked language model

randomly masks some of the tokens from the input,

and the objective is to predict the original token

based only on its context.

Follow-up work including RoBERTa (Liu

et al., 2019b) investigated hyper-parameter design

choices and suggested longer model training time.

In addition, XLNet (Yang et al., 2019) has been

proposed to address the BERT pre-training and

ﬁne-tuning discrepancy where masked tokens were

found in the former but not in the latter. Nearly

all existing work suggests that a large network is

crucial to achieve the state-of-the-art performance.

For example, (Devlin et al., 2018) has shown that

across natural language understanding tasks, using

larger hidden layer size, more hidden layers, and

more attention heads always leads to better perfor-

mance. However, they stop at a hidden layer size

of 1024. ALBERT (Lan et al., 2019) showed that

it is not the case that simply increasing the model

size would lead to better accuracy. In fact, they

observed that simply increasing the hidden layer

size of a model such as BERT-large can lead to sig-

niﬁcantly worse performance. On the other hand,

model distillation (Hinton et al., 2015; Tang et al.,

2019; Sun et al., 2019; Sanh et al., 2019) has been

proposed to reduce the BERT model size while

maintaining high performance.

In this paper, we attempt to improve the per-

formance of BERT via architecture enhancement.

BERT is based on the encoder of the trans-

former model (Vaswani et al., 2017), which has

been proven to obtain state-of-the-art accuracy

across a broad range of NLP applications (Devlin

et al., 2018). Prior to BERT, bidirectional LSTM

(BLSTM) has dominated sequential modeling for

many tasks including machine translation (Chiu

and Nichols, 2016) and speech recognition (Graves

et al., 2013). Given both models have demonstrated

superior accuracy on various benchmarks, it is nat-

ural to raise the question whether a combination of

the transformer and BLSTM can outperform each

arXiv:2003.07000v1 [cs.CL] 16 Mar 2020

individual architecture. In this paper, we attempt

to answer this question by proposing a transformer

BLSTM joint modeling framework. Our major con-

tribution in this paper is two fold: 1) We propose

the TRANS-BLSTM model architectures, which

combine the transformer and BLSTM into one sin-

gle modeling framework, leveraging the modeling

capability from both the transformer and BLSTM.

2) We show that the TRANS-BLSTM models can

effectively boost the accuracy of BERT baseline

models on SQuAD 1.1 and GLUE NLP benchmark

datasets.

2 Related work

2.1 BERT

Our work focuses on improving the transformer ar-

chitecture (Vaswani et al., 2017), which motivated

the recent breakthrough in language representa-

tion, BERT (Devlin et al., 2018). Our work builds

on top of the transformer architecture, integrating

each transformer block with a bidirectional LSTM

(Hochreiter and Schmidhuber, 1997). Related to

our work, XLNet (Yang et al., 2019) proposes two-

stream self-attention as opposed to single-stream

self-attention used in classic transformers. With

two-stream attention, XLNet can be treated as a

general language model that does not suffer from

the pretrain-ﬁnetune discrepancy (the mask tokens

are seen during pretraining but not during ﬁnetun-

ing) thanks to its autoregressive formulation. Our

method overcomes this limitation with a different

approach, using single-stream self-attention with

an integrated BLSTM layer for each transformer

layer.

2.2 Bidirectional LSTM

The LSTM network (Hochreiter and Schmidhu-

ber, 1997) has demonstrated powerful modeling

capability in sequential learning tasks including

named entity tagging (Huang et al., 2015; Chiu

and Nichols, 2016), machine translation (Bahdanau

et al., 2015; Wu et al., 2016) and speech recogni-

tion (Graves et al., 2013; Sak et al., 2014). The

motivation of this paper is to integrate bidirectional

LSTM layers to the transformer model to further

improve transformer performance. The work of

(Tang et al., 2019) attempts to distill a BERT model

to a single-layer bidirectional LSTM model. It is

relevant to our work as both utilizing bidirectional

LSTM. However, their work leads to inferior accu-

racy compared to BERT baseline models. Similar

to their observation, we show that in our experi-

ments, the use of BLSTM model alone (even with

multiple stacked BLSTM layers) leads to signif-

icantly worse results compared to BERT models.

However, our proposed joint modeling framework,

TRANS-BLSTM, is able to boost the accuracy of

the transformer BERT models.

2.3 Combine Recurrent Network and

Transformer

Previous work has explored the combination of the

recurrent network and transformer. For example,

(Lei et al., 2018) has substituted the feedforward

network in transformer with the simple recurrent

unit (SRU) implementation and achieved better ac-

curacy in machine translation. It is similar to one

of the proposed models in this paper. However, the

difference is that our paper investigates the gain

of the combination in BERT pre-training context,

while their paper focused on the parallelization

speedup of SRU in machine translation encoder

and decoder context.

3 TRANS and Proposed

TRANS-BLSTM Architectures

In this section, we ﬁrst review the transformer archi-

tecture, then propose the transformer bidirectional

LSTM network architectures (TRANS-BLSTM),

which integrates the BLSTM to either the trans-

former encoder or decoder.

3.1 Transformer architecture (TRANS)

The BERT model consists of a transformer en-

coder (Vaswani et al., 2017) as shown in Figure

1. The original transformer architecture uses mul-

tiple stacked self-attention layers and point-wise

fully connected layers for both the encoder and de-

coder. However, BERT only leverages the encoder

to generate hidden value representation and the

original transformer decoder (for generating text

in neural machine translation etc.) is replaced by

a linear layer followed by a softmax layer, shown

in Figure 1, both for sequential classiﬁcation tasks

(named entity tagging, question answering) and

sentence classiﬁcation tasks (sentiment classiﬁca-

tion etc.). The encoder is composed of a stack of

N = 12

N = 24

layers for the BERT-base and

-large cases respectively. Each layer consists of two

sub-layers. The ﬁrst sub-layer is a multi-head self-

attention mechanism, and the second sub-layer is a

simple, position-wise fully connected feed-forward

of 9

50墨值下载

关注

评论