暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
TRANS-BLSTM Transformer with Bidirectional LSTM for Language.pdf
139
9页
0次
2023-10-26
50墨值下载
TRANS-BLSTM: Transformer with Bidirectional LSTM for Language
Understanding
Zhiheng Huang
Amazon AWS AI
zhiheng@amazon.com
Peng Xu
Amazon AWS AI
pengx@amazon.com
Davis Liang
Amazon AWS AI
liadavis@amazon.com
Ajay Mishra
Amazon AWS AI
misaja@amazon.com
Bing Xiang
Amazon AWS AI
bxiang@amazon.com
Abstract
Bidirectional Encoder Representations from
Transformers (BERT) has recently achieved
state-of-the-art performance on a broad range
of NLP tasks including sentence classifica-
tion, machine translation, and question answer-
ing. The BERT model architecture is de-
rived primarily from the transformer. Prior to
the transformer era, bidirectional Long Short-
Term Memory (BLSTM) has been the domi-
nant modeling architecture for neural machine
translation and question answering. In this pa-
per, we investigate how these two modeling
techniques can be combined to create a more
powerful model architecture. We propose
a new architecture denoted as Transformer
with BLSTM (TRANS-BLSTM) which has a
BLSTM layer integrated to each transformer
block, leading to a joint modeling framework
for transformer and BLSTM. We show that
TRANS-BLSTM models consistently lead to
improvements in accuracy compared to BERT
baselines in GLUE and SQuAD 1.1 experi-
ments. Our TRANS-BLSTM model obtains
an F1 score of 94.01% on the SQuAD 1.1 de-
velopment dataset, which is comparable to the
state-of-the-art result.
1 Introduction
Learning representations (Mikolov et al., 2013) of
natural language and language model pre-training
(Devlin et al., 2018; Radford et al., 2019) has
shown promising results recently. These pre-
trained models serve as generic up-stream models
and they can be used to improve down-stream ap-
plications such as natural language inference, para-
phrasing, named entity recognition, and question
answering. The innovation of BERT (Devlin et al.,
2018) comes from the “masked language model”
with a pre-training objective, inspired by the Cloze
task (Taylor, 1953). The masked language model
randomly masks some of the tokens from the input,
and the objective is to predict the original token
based only on its context.
Follow-up work including RoBERTa (Liu
et al., 2019b) investigated hyper-parameter design
choices and suggested longer model training time.
In addition, XLNet (Yang et al., 2019) has been
proposed to address the BERT pre-training and
fine-tuning discrepancy where masked tokens were
found in the former but not in the latter. Nearly
all existing work suggests that a large network is
crucial to achieve the state-of-the-art performance.
For example, (Devlin et al., 2018) has shown that
across natural language understanding tasks, using
larger hidden layer size, more hidden layers, and
more attention heads always leads to better perfor-
mance. However, they stop at a hidden layer size
of 1024. ALBERT (Lan et al., 2019) showed that
it is not the case that simply increasing the model
size would lead to better accuracy. In fact, they
observed that simply increasing the hidden layer
size of a model such as BERT-large can lead to sig-
nificantly worse performance. On the other hand,
model distillation (Hinton et al., 2015; Tang et al.,
2019; Sun et al., 2019; Sanh et al., 2019) has been
proposed to reduce the BERT model size while
maintaining high performance.
In this paper, we attempt to improve the per-
formance of BERT via architecture enhancement.
BERT is based on the encoder of the trans-
former model (Vaswani et al., 2017), which has
been proven to obtain state-of-the-art accuracy
across a broad range of NLP applications (Devlin
et al., 2018). Prior to BERT, bidirectional LSTM
(BLSTM) has dominated sequential modeling for
many tasks including machine translation (Chiu
and Nichols, 2016) and speech recognition (Graves
et al., 2013). Given both models have demonstrated
superior accuracy on various benchmarks, it is nat-
ural to raise the question whether a combination of
the transformer and BLSTM can outperform each
arXiv:2003.07000v1 [cs.CL] 16 Mar 2020
individual architecture. In this paper, we attempt
to answer this question by proposing a transformer
BLSTM joint modeling framework. Our major con-
tribution in this paper is two fold: 1) We propose
the TRANS-BLSTM model architectures, which
combine the transformer and BLSTM into one sin-
gle modeling framework, leveraging the modeling
capability from both the transformer and BLSTM.
2) We show that the TRANS-BLSTM models can
effectively boost the accuracy of BERT baseline
models on SQuAD 1.1 and GLUE NLP benchmark
datasets.
2 Related work
2.1 BERT
Our work focuses on improving the transformer ar-
chitecture (Vaswani et al., 2017), which motivated
the recent breakthrough in language representa-
tion, BERT (Devlin et al., 2018). Our work builds
on top of the transformer architecture, integrating
each transformer block with a bidirectional LSTM
(Hochreiter and Schmidhuber, 1997). Related to
our work, XLNet (Yang et al., 2019) proposes two-
stream self-attention as opposed to single-stream
self-attention used in classic transformers. With
two-stream attention, XLNet can be treated as a
general language model that does not suffer from
the pretrain-finetune discrepancy (the mask tokens
are seen during pretraining but not during finetun-
ing) thanks to its autoregressive formulation. Our
method overcomes this limitation with a different
approach, using single-stream self-attention with
an integrated BLSTM layer for each transformer
layer.
2.2 Bidirectional LSTM
The LSTM network (Hochreiter and Schmidhu-
ber, 1997) has demonstrated powerful modeling
capability in sequential learning tasks including
named entity tagging (Huang et al., 2015; Chiu
and Nichols, 2016), machine translation (Bahdanau
et al., 2015; Wu et al., 2016) and speech recogni-
tion (Graves et al., 2013; Sak et al., 2014). The
motivation of this paper is to integrate bidirectional
LSTM layers to the transformer model to further
improve transformer performance. The work of
(Tang et al., 2019) attempts to distill a BERT model
to a single-layer bidirectional LSTM model. It is
relevant to our work as both utilizing bidirectional
LSTM. However, their work leads to inferior accu-
racy compared to BERT baseline models. Similar
to their observation, we show that in our experi-
ments, the use of BLSTM model alone (even with
multiple stacked BLSTM layers) leads to signif-
icantly worse results compared to BERT models.
However, our proposed joint modeling framework,
TRANS-BLSTM, is able to boost the accuracy of
the transformer BERT models.
2.3 Combine Recurrent Network and
Transformer
Previous work has explored the combination of the
recurrent network and transformer. For example,
(Lei et al., 2018) has substituted the feedforward
network in transformer with the simple recurrent
unit (SRU) implementation and achieved better ac-
curacy in machine translation. It is similar to one
of the proposed models in this paper. However, the
difference is that our paper investigates the gain
of the combination in BERT pre-training context,
while their paper focused on the parallelization
speedup of SRU in machine translation encoder
and decoder context.
3 TRANS and Proposed
TRANS-BLSTM Architectures
In this section, we first review the transformer archi-
tecture, then propose the transformer bidirectional
LSTM network architectures (TRANS-BLSTM),
which integrates the BLSTM to either the trans-
former encoder or decoder.
3.1 Transformer architecture (TRANS)
The BERT model consists of a transformer en-
coder (Vaswani et al., 2017) as shown in Figure
1. The original transformer architecture uses mul-
tiple stacked self-attention layers and point-wise
fully connected layers for both the encoder and de-
coder. However, BERT only leverages the encoder
to generate hidden value representation and the
original transformer decoder (for generating text
in neural machine translation etc.) is replaced by
a linear layer followed by a softmax layer, shown
in Figure 1, both for sequential classification tasks
(named entity tagging, question answering) and
sentence classification tasks (sentiment classifica-
tion etc.). The encoder is composed of a stack of
N = 12
or
N = 24
layers for the BERT-base and
-large cases respectively. Each layer consists of two
sub-layers. The first sub-layer is a multi-head self-
attention mechanism, and the second sub-layer is a
simple, position-wise fully connected feed-forward
of 9
50墨值下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜