Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

Page created by Carolyn Manning
 
CONTINUE READING
Hi-Transformer: Hierarchical Interactive Transformer for Efficient and
 Effective Long Document Modeling

 Chuhan Wu† Fangzhao Wu‡ Tao Qi† Yongfeng Huang†
†
 Department of Electronic Engineering & BNRist, Tsinghua University, Beijing 100084, China
 ‡
 Microsoft Research Asia, Beijing 100080, China
 {wuchuhan15, wufangzhao, taoqi.qt}@gmail.com
 yfhuang@tsinghua.edu.cn

 Abstract 2019; Kitaev et al., 2019; Wang et al., 2020; Qiu
 et al., 2020). One direction is using Transformer
 Transformer is important for text modeling.
 in a hierarchical manner to reduce sequence length,
 However, it has difficulty in handling long
 documents due to the quadratic complexity
 e.g., first learn sentence representations and then
 with input text length. In order to handle learn document representations from sentence rep-
 this problem, we propose a hierarchical inter- resentations (Zhang et al., 2019; Yang et al., 2020).
 active Transformer (Hi-Transformer) for effi- However, the modeling of sentences is agnostic to
 cient and effective long document modeling. the global document context, which may be subop-
 Hi-Transformer models documents in a hierar- timal because the local context within sentence is
 chical way, i.e., first learns sentence represen- usually insufficient. Another direction is using a
 tations and then learns document representa-
 sparse self-attention matrix instead of a dense one.
 tions. It can effectively reduce the complexity
 and meanwhile capture global document con- For example, Beltagy et al. (2020) proposed to
 text in the modeling of each sentence. More combine local self-attention with a dilated sliding
 specifically, we first use a sentence Trans- window and sparse global attention. Zaheer et al.
 former to learn the representations of each (2020) proposed to incorporate a random sparse
 sentence. Then we use a document Trans- attention mechanism to model the interactions be-
 former to model the global document context tween a random set of tokens. However, these
 from these sentence representations. Next, we
 methods cannot fully model the global context of
 use another sentence Transformer to enhance
 sentence modeling using the global document
 document (Tay et al., 2020).
 context. Finally, we use hierarchical pooling In this paper, we propose a hierarchical interac-
 method to obtain document embedding. Exten- tive Transformer (Hi-Transformer)1 for efficient
 sive experiments on three benchmark datasets and effective long document modeling, which mod-
 validate the efficiency and effectiveness of Hi- els documents in a hierarchical way to effectively
 Transformer in long document modeling. reduce the complexity and at the same time can
 capture the global document context for sentence
1 Introduction
 modeling. In Hi-Transformer, we first use a sen-
Transformer (Vaswani et al., 2017) is an effective tence Transformer to learn the representation of
architecture for text modeling, and has been an es- each sentence within a document. Next, we use a
sential component in many state-of-the-art NLP document Transformer to model the global docu-
models like BERT (Devlin et al., 2019; Radford ment context from these sentence representations.
et al., 2019; Yang et al., 2019; Wu et al., 2021). The Then, we use another sentence Transformer to fur-
standard Transformer needs to compute a dense ther improve the modeling of each sentence with
self-attention matrix based on the interactions be- the help of the global document context. Finally,
tween each pair of tokens in text, where the compu- we use hierarchical pooling method to obtain the
tational complexity is proportional to the square of document representation. Extensive experiments
text length (Vaswani et al., 2017; Wu et al., 2020b). are conducted on three benchmark datasets. The
Thus, it is difficult for Transformer to model long results show that Hi-Transformer is both efficient
documents efficiently (Child et al., 2019). and effective in long document modeling.
 There are several methods to accelerate Trans-
 1
former for long document modeling (Wu et al., https://github.com/wuch15/HiTransformer.

 848
 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
 and the 11th International Joint Conference on Natural Language Processing (Short Papers), pages 848–853
 August 1–6, 2021. ©2021 Association for Computational Linguistics
Document
 Embedding

 Pooling

 1 2 
 Global Context-aware
 Sentence Embedding
 …

 Pooling Pooling Pooling

 1,1 1,2 1 2,1 2,2 2 ,1 ,2 
 
 … 1, … 2, … , 

 Sentence Transformer Sentence Transformer … Sentence Transformer

 Hi-Transformer
 1 Document Context- 2 
 
 aware Sentence …
 Representations

N× Document Transformer

 + + +
 1,1 1,2 1, 1 1 2,1 2,2 2, 2 2 ,1 ,2 , 
 
 … … …
 Position

 Sentence Transformer
 Embedding
 Sentence Transformer … Sentence Transformer

 1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , 
 … … …

 Word & Position Embedding Word & Position Embedding Word & Position Embedding

 Shop the … without [CLS] The royals … is [CLS] Prince Philip … ) [CLS]

 Sentence 1 Sentence 2 Sentence M

 Document

 Figure 1: The architecture of Hi-Transformer.

2 Hi-Transformer modeled by a sentence Transformer. Assume a doc-
 ument contains M sentences, and the words in the
In this section, we introduce our hierarchical in-
 i-th sentence are denoted as [wi,1 , wi,2 , ..., wi,K ]
teractive Transformer (Hi-Transformer) approach
 (K is the sentence length). We insert a “[CLS]” to-
for efficient and effective long document model-
 ken (denoted as ws ) after the end of each sentence.
ing. Its framework is shown in Fig. 1. It uses a
 This token is used to convey the contextual informa-
hierarchical architecture that first models the con-
 tion within this sentence. The sequence of words in
texts within a sentence, next models the document
 each sentence is first converted into a word embed-
contexts by capturing the interactions between sen-
 ding sequence via a word and position embedding
tences, then employs the global document contexts
 layer. Denote the word embedding sequence for the
to enhance sentence modeling, and finally uses hi-
 i-th sentence as [ei,1 , ei,2 , ..., ei,K , es ]. Since sen-
erarchical pooling techniques to obtain document
 tence length is usually short, we apply a sentence
embeddings. In this way, the input sequence length
 Transformer to each sentence to fully model the in-
of each Transformer is much shorter than directly
 teractions between the words within this sentence.
taking the word sequence in document as input, and
 It takes the word embedding sequence as the input,
the global contexts can be fully modeled. The de-
 and outputs the contextual representations of words,
tails of Hi-Transformer are introduced as follows.
 which are denoted as [hi,1 , hi,2 , ..., hi,K , hsi ]. Spe-
2.1 Model Architecture cially, the representation hsi of the “[CLS]” token
Hi-Transformer mainly contains three modules, i.e., is regarded as the sentence representation.
sentence context modeling, document context mod- Next, the document-level context is modeled by
eling and global document context-enhanced sen- a document Transformer from the representations
tence modeling. The sentence-level context is first of the sentences within this document. Denote the

 849
embedding sequence of sentences in this document 3 Experiments
as [hs1 , hs2 , ..., hsM ]. We add a sentence position
embedding (denoted as pi for the i-th sentence) 3.1 Datasets and Experimental Settings
to the sentence representations to capture sentence Our experiments are conducted on three bench-
orders. We then apply a document Transformer to mark document modeling datasets. The first one
these sentence representations to capture the global is Amazon Electronics (He and McAuley, 2016)
context of document, and further learn document (denoted as Amazon), which is for product review
context-aware sentence representations, which are rating prediction.3 The second one is IMDB (Diao
denoted as [rs1 , rs2 , ..., rsM ]. et al., 2014), a widely used dataset for movie re-
 Then, we use the document context-aware sen- view rating prediction.4 The third one is the MIND
tence representations to further improve the sen- dataset (Wu et al., 2020c), which is a large-scale
tence context modeling by propagating the global dataset for news intelligence.5 We use the content
document context to each sentence. Motivated based news topic classification task on this dataset.
by (Guo et al., 2019), we apply another sentence The detailed dataset statistics are shown in Table 1.
Transformer to the hidden word representations In our experiments, we use the 300-dimensional
and the document-aware sentence representation pre-trained Glove (Pennington et al., 2014) embed-
for each sentence. It outputs a document context- dings for initializing word embeddings. We use
aware word representation sequence for each sen- two Hi-Transformers layers in our approach and
tence, which is denoted as [di,1 , di,2 , ..., di,K , dsi ]. two Transformer layers in other baseline methods.6
In this way, the contextual representations of words We use attentive pooling (Yang et al., 2016) to
can benefit from both local sentence context and implement the hierarchical pooling module. The
global document context. hidden dimension is set to 256, i.e., 8 self-attention
 By stacking multiple layers of Hi-Transformer, heads in total and the output dimension of each
the contexts within a document can be fully mod- head is 32. Due to the limitation of GPU memory,
eled. Finally, we use hierarchical pooling (Wu the input sequence lengths of vanilla Transformer
et al., 2020a) techniques to obtain the document em- and its variants for long documents are 512 and
bedding. We first aggregate the document context- 2048, respectively. The dropout (Srivastava et al.,
aware word representations in each sentence into a 2014) ratio is 0.2. The optimizer is Adam (Bengio
global context-aware sentence embedding si , and and LeCun, 2015), and the learning rate is 1e-4.
then aggregate the global context-aware embed- The maximum training epoch is 3. The models
dings of sentence within a document into a unified are implemented using the Keras library with Ten-
document embedding d, which is further used for sorflow backend. The GPU we used is GeForce
downstream tasks. GTX 1080 Ti with a memory of 11 GB. We use
 accuracy and macro-F scores as the performance
 metrics. We repeat each experiment 5 times and
2.2 Efficiency Analysis
 report both average results and standard deviations.
In this section, we provide some discussions on the
computational complexity of Hi-Transformer. In 3.2 Performance Evaluation
sentence context modeling and document context We compare Hi-Transformer with several base-
propagation, the total computational complexity is lines, including: (1) Transformer (Vaswani et al.,
O(M · K 2 · d), where M is sentence number with a 2017), the vanilla Transformer architecture; (2)
document, K is sentence length, and d is the hidden Longformer (Beltagy et al., 2020), a variant of
dimension. In document context modeling, the Transformer with local and global attention for
computational complexity is O(M 2 · d). Thus, the long documents; (3) BigBird (Zaheer et al., 2020),
total computational cost is O(M ·K 2 ·d+M 2 ·d).2 extending Longformer with random attention; (4)
Compared with the standard Transformer whose HI-BERT (Zhang et al., 2019), using Transformers
computational complexity is O(M 2 · K 2 · d), Hi-
 3
Transformer is much more efficient. https://jmcauley.ucsd.edu/data/amazon/
 4
 https://github.com/nihalb/JMARS
 5
 https://msnews.github.io/
 2 6
 Note that Hi-Transformer can be combined with other We also tried more Transformer layers for baseline meth-
existing techniques of efficient Transformer to further improve ods but do not observe significant performance improvement
the efficiency for long document modeling. in our experiments.

 850
Dataset #Train #Val #Test Avg. #word Avg. #sent #Class
 Amazon 40.0k 5.0k 5.0k 133.38 6.17 5
 IMDB 108.5k 13.6k 13.6k 385.70 15.29 10
 MIND 128.8k 16.1k 16.1k 505.46 25.14 18

 Table 1: Statistics of datasets.

 Amazon IMDB MIND
 Methods
 Accuracy Macro-F Accuracy Macro-F Accuracy Macro-F
 Transformer 65.23±0.38 42.23±0.37 51.98±0.48 42.76±0.49 80.96±0.22 59.97±0.24
 Longformer 65.35±0.44 42.45±0.41 52.33±0.40 43.51±0.42 81.42±0.25 62.68±0.26
 BigBird 66.05±0.48 42.89±0.46 52.87±0.51 43.79±0.50 81.81±0.29 63.44±0.31
 HI-BERT 66.56±0.32 42.65±0.34 52.96±0.46 43.84±0.46 81.89±0.23 63.63±0.20
 Hi-Transformer 67.24±0.35 43.69±0.32 53.78±0.49 44.54±0.47 82.51±0.25 64.22±0.22

 Table 2: The results of different methods on different datasets.

 Method Complexity sults indicate the efficiency and effectiveness of
 Transformer O(M 2 · K 2 · d) Hi-Transformer.
 Longformer O(T · M · K · d)
 BigBird O(T · M · K · d) 3.3 Model Effectiveness
 HI-BERT O(M · K 2 · d + M 2 · d) Nest, we verify the effectiveness of the global doc-
 Hi-Transformer O(M · K 2 · d + M 2 · d) ument contexts for enhancing sentence modeling
 in Hi-Transformer. We compare Hi-Transformer
Table 3: Complexity of different methods. K is sen- and its variants without global document contexts
tence length, M is the number of sentences in a docu- in Fig. 2. We find the performance consistently
ment, T is the number of positions for sparse attention,
 declines when the global document contexts are
and d is the hidden dimension.
 not encoded into sentence representations. This is
 because the local contexts within a single sentence
at both word and sentence levels. The results of may be insufficient for accurate sentence model-
these methods on the three datasets are shown in Ta- ing, and global contexts in the entire document
ble 2. We find that Transformers designed for long can provide rich complementary information for
documents like Hi-Transformer and BigBird out- sentence understanding. Thus, propagating the doc-
perform the vanilla Transformer. This is because ument contexts to enhance sentence modeling can
vanilla Transformer cannot handle long sequence improve long document modeling.
due to the restriction of computation resources, and
 3.4 Influence of Text Length
truncating the input sequence leads to the loss of
much useful contextual information. In addition, Then, we study the influence of text length on the
Hi-Transformer and HI-BERT outperform Long- model performance and computational cost. Since
former and BigBird. This is because the sparse the documents in the MIND dataset are longest,
attention mechanism used in Longformer and Big- we conduct experiments on MIND to compare the
Bird cannot fully model the global contexts within model performance as well as the training time
a document. Besides, Hi-Transformer achieves the per layer of Transformer and Hi-Transformer un-
best performance, and the t-test results show the der different input text length7 , and the results are
improvements over baselines are significant. This shown in Fig. 3. We find the performance of both
is because Hi-Transformer can incorporate global methods improves when longer text sequences are
document contexts to enhance sentence modeling. used. This is intuitive because more information
 We also compare the computational complexity can be incorporated when longer text is input to
of these methods in Table 3. The complexity of the model for document modeling. However, the
Hi-Transformer is much less than the vanilla Trans- computational cost of Transformer grows very fast,
former and is comparable with other Transformer 7
 The maximum length of Transformer is 512 due to GPU
variants designed for long documents. These re- memory limitation.

 851
68.0 44.0 85.0
 43.69
 67.5 67.24
 43.5 83.0

 Accuracy
 67.0 43.0 81.0
 Accuracy

 Macro-F
 42.70
 66.61
 66.5 42.5
 79.0
 66.0 42.0 Transformer
 77.0
 Hi-Transformer Hi-Transformer
 65.5 41.5
 - Global document context 75.0
 65.0 41.0 64 128 256 512 1024 2048
 Accuracy Macro-F Text Length

 (a) Amazon. (a) Accuracy.

 54.5 45.0 68.0
 44.54
 54.0 53.78 44.5 64.0
 53.5 43.92 44.0

 Macro-F
 60.0
 Accuracy

 Macro-F
 53.13
 53.0 43.5
 56.0
 52.5 43.0
 52.0 Transformer
 Hi-Transformer Hi-Transformer
 52.0 42.5
 - Global document context 48.0
 51.5 42.0 64 128 256 512 1024 2048
 Accuracy Macro-F Text Length

 (b) IMDB. (b) Macro-F.

 83.0 65.0 0.6
 Transformer
 Time per Iteration/s

 82.51
 82.5 64.5 0.5 Hi-Transformer
 64.22

 82.0 81.92 64.0 0.4
 Accuracy

 Macro-F

 63.56 0.3
 81.5 63.5
 0.2
 81.0 63.0
 Hi-Transformer 0.1
 80.5 62.5
 - Global document context 0.0
 80.0 62.0 64 128 256 512 1024 2048
 Accuracy Macro-F Text Length

 (c) MIND. (c) Training time per layer.

Figure 2: Effectiveness of global document context Figure 3: Influence of input text length on performance
propagation in Hi-Transformer . and training time on the MIND dataset.

which limits its maximal input text length. Dif- contexts in sentence modeling to help understand
ferent from Transformer, Hi-Transformer is much document content accurately. Extensive experi-
more efficient and meanwhile can achieve better ments on three benchmark datasets validate the
performance with longer sequence length. These re- efficiency and effectiveness of Hi-Transformer in
sults further verify the efficiency and effectiveness long document modeling.
of Hi-Transformer in long document modeling.

4 Conclusion Acknowledgments

In this paper, we propose a Hi-Transformer ap- This work was supported by the National Natural
proach for both efficient and effective long docu- Science Foundation of China under Grant numbers
ment modeling. It incorporates a hierarchical ar- U1936216, U1936208, U1836204, and U1705261.
chitecture that first learns sentence representations We are grateful to Xing Xie, Shaoyu Zhou, Dan
and then learns document representations. It can Shen, and Zhisong Wang for their insightful com-
effectively reduce the computational complexity ments and suggestions on this work.
and meanwhile be aware of the global document

 852
References Kaiser, and Illia Polosukhin. 2017. Attention is all
 you need. In NIPS, pages 5998–6008.
Iz Beltagy, Matthew E Peters, and Arman Cohan.
 Sinong Wang, Belinda Li, Madian Khabsa, Han
 2020. Longformer: The long-document transformer.
 Fang, and Hao Ma. 2020. Linformer: Self-
 arXiv preprint arXiv:2004.05150.
 attention with linear complexity. arXiv preprint
Yoshua Bengio and Yann LeCun. 2015. Adam: A arXiv:2006.04768.
 method for stochastic optimization. In ICLR.
 Chuhan Wu, Fangzhao Wu, and Yongfeng Huang.
Rewon Child, Scott Gray, Alec Radford, and 2021. DA-transformer: Distance-aware transformer.
 Ilya Sutskever. 2019. Generating long se- In NAACL-HLT, pages 2059–2068.
 quences with sparse transformers. arXiv preprint
 arXiv:1904.10509. Chuhan Wu, Fangzhao Wu, Tao Qi, Xiaohui Cui, and
 Yongfeng Huang. 2020a. Attentive pooling with
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and learnable norms for text representation. In ACL,
 Kristina Toutanova. 2019. Bert: Pre-training of deep pages 2961–2970.
 bidirectional transformers for language understand-
 ing. In NAACL-HLT, pages 4171–4186.
 Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- Huang. 2020b. Improving attention mechanism
 der J Smola, Jing Jiang, and Chong Wang. 2014. with query-value interaction. arXiv preprint
 Jointly modeling aspects, ratings and sentiments for arXiv:2010.03766.
 movie recommendation (jmars). In KDD, pages
 193–202. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan
 Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie,
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Jianfeng Gao, Winnie Wu, et al. 2020c. Mind: A
 Xiangyang Xue, and Zheng Zhang. 2019. Star- large-scale dataset for news recommendation. In
 transformer. In NAACL-HLT, pages 1315–1325. ACL, pages 3597–3606.
Ruining He and Julian McAuley. 2016. Ups and downs:
 Modeling the visual evolution of fashion trends with Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song
 one-class collaborative filtering. In WWW, pages Han. 2019. Lite transformer with long-short range
 507–517. attention. In ICLR.

Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Liu Yang, Mingyang Zhang, Cheng Li, Michael Ben-
 2019. Reformer: The efficient transformer. In dersky, and Marc Najork. 2020. Beyond 512 to-
 ICLR. kens: Siamese multi-depth transformer-based hier-
 archical encoder for long-form document matching.
Jeffrey Pennington, Richard Socher, and Christopher In CIKM, pages 1725–1734.
 Manning. 2014. Glove: Global vectors for word rep-
 resentation. In EMNLP, pages 1532–1543. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, bonell, Russ R Salakhutdinov, and Quoc V Le. 2019.
 Sinong Wang, and Jie Tang. 2020. Blockwise Xlnet: Generalized autoregressive pretraining for
 self-attention for long document understanding. In language understanding. In NeurIPS, pages 5753–
 EMNLP: Findings, pages 2555–2565. 5763.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
 Dario Amodei, and Ilya Sutskever. 2019. Language Alex Smola, and Eduard Hovy. 2016. Hierarchical
 models are unsupervised multitask learners. OpenAI attention networks for document classification. In
 blog, 1(8):9. NAACL-HLT, pages 1480–1489.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
 Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Joshua Ainslie, Chris Alberti, Santiago Ontanon,
 Dropout: a simple way to prevent neural networks Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
 from overfitting. JMLR, 15(1):1929–1958. et al. 2020. Big bird: Transformers for longer se-
 quences. arXiv preprint arXiv:2007.14062.
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald
 Metzler. 2020. Efficient transformers: A survey.
 arXiv preprint arXiv:2009.06732. Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. Hi-
 bert: Document level pre-training of hierarchical
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob bidirectional transformers for document summariza-
 Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz tion. In ACL, pages 5059–5069.

 853
You can also read