individual architecture. In this paper, we attempt
to answer this question by proposing a transformer
BLSTM joint modeling framework. Our major con-
tribution in this paper is two fold: 1) We propose
the TRANS-BLSTM model architectures, which
combine the transformer and BLSTM into one sin-
gle modeling framework, leveraging the modeling
capability from both the transformer and BLSTM.
2) We show that the TRANS-BLSTM models can
effectively boost the accuracy of BERT baseline
models on SQuAD 1.1 and GLUE NLP benchmark
datasets.
2 Related work
2.1 BERT
Our work focuses on improving the transformer ar-
chitecture (Vaswani et al., 2017), which motivated
the recent breakthrough in language representa-
tion, BERT (Devlin et al., 2018). Our work builds
on top of the transformer architecture, integrating
each transformer block with a bidirectional LSTM
(Hochreiter and Schmidhuber, 1997). Related to
our work, XLNet (Yang et al., 2019) proposes two-
stream self-attention as opposed to single-stream
self-attention used in classic transformers. With
two-stream attention, XLNet can be treated as a
general language model that does not suffer from
the pretrain-finetune discrepancy (the mask tokens
are seen during pretraining but not during finetun-
ing) thanks to its autoregressive formulation. Our
method overcomes this limitation with a different
approach, using single-stream self-attention with
an integrated BLSTM layer for each transformer
layer.
2.2 Bidirectional LSTM
The LSTM network (Hochreiter and Schmidhu-
ber, 1997) has demonstrated powerful modeling
capability in sequential learning tasks including
named entity tagging (Huang et al., 2015; Chiu
and Nichols, 2016), machine translation (Bahdanau
et al., 2015; Wu et al., 2016) and speech recogni-
tion (Graves et al., 2013; Sak et al., 2014). The
motivation of this paper is to integrate bidirectional
LSTM layers to the transformer model to further
improve transformer performance. The work of
(Tang et al., 2019) attempts to distill a BERT model
to a single-layer bidirectional LSTM model. It is
relevant to our work as both utilizing bidirectional
LSTM. However, their work leads to inferior accu-
racy compared to BERT baseline models. Similar
to their observation, we show that in our experi-
ments, the use of BLSTM model alone (even with
multiple stacked BLSTM layers) leads to signif-
icantly worse results compared to BERT models.
However, our proposed joint modeling framework,
TRANS-BLSTM, is able to boost the accuracy of
the transformer BERT models.
2.3 Combine Recurrent Network and
Transformer
Previous work has explored the combination of the
recurrent network and transformer. For example,
(Lei et al., 2018) has substituted the feedforward
network in transformer with the simple recurrent
unit (SRU) implementation and achieved better ac-
curacy in machine translation. It is similar to one
of the proposed models in this paper. However, the
difference is that our paper investigates the gain
of the combination in BERT pre-training context,
while their paper focused on the parallelization
speedup of SRU in machine translation encoder
and decoder context.
3 TRANS and Proposed
TRANS-BLSTM Architectures
In this section, we first review the transformer archi-
tecture, then propose the transformer bidirectional
LSTM network architectures (TRANS-BLSTM),
which integrates the BLSTM to either the trans-
former encoder or decoder.
3.1 Transformer architecture (TRANS)
The BERT model consists of a transformer en-
coder (Vaswani et al., 2017) as shown in Figure
1. The original transformer architecture uses mul-
tiple stacked self-attention layers and point-wise
fully connected layers for both the encoder and de-
coder. However, BERT only leverages the encoder
to generate hidden value representation and the
original transformer decoder (for generating text
in neural machine translation etc.) is replaced by
a linear layer followed by a softmax layer, shown
in Figure 1, both for sequential classification tasks
(named entity tagging, question answering) and
sentence classification tasks (sentiment classifica-
tion etc.). The encoder is composed of a stack of
N = 12
or
N = 24
layers for the BERT-base and
-large cases respectively. Each layer consists of two
sub-layers. The first sub-layer is a multi-head self-
attention mechanism, and the second sub-layer is a
simple, position-wise fully connected feed-forward
评论