Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling Chuhan Wu† Fangzhao Wu‡ Tao Qi† Yongfeng Huang† † Department of Electronic Engineering & BNRist, Tsinghua University, Beijing 100084, China ‡ Microsoft Research Asia, Beijing 100080, China {wuchuhan15, wufangzhao, taoqi.qt}@gmail.com yfhuang@tsinghua.edu.cn Abstract 2019; Kitaev et al., 2019; Wang et al., 2020; Qiu et al., 2020). One direction is using Transformer Transformer is important for text modeling. in a hierarchical manner to reduce sequence length, However, it has difficulty in handling long documents due to the quadratic complexity e.g., first learn sentence representations and then with input text length. In order to handle learn document representations from sentence rep- this problem, we propose a hierarchical inter- resentations (Zhang et al., 2019; Yang et al., 2020). active Transformer (Hi-Transformer) for effi- However, the modeling of sentences is agnostic to cient and effective long document modeling. the global document context, which may be subop- Hi-Transformer models documents in a hierar- timal because the local context within sentence is chical way, i.e., first learns sentence represen- usually insufficient. Another direction is using a tations and then learns document representa- sparse self-attention matrix instead of a dense one. tions. It can effectively reduce the complexity and meanwhile capture global document con- For example, Beltagy et al. (2020) proposed to text in the modeling of each sentence. More combine local self-attention with a dilated sliding specifically, we first use a sentence Trans- window and sparse global attention. Zaheer et al. former to learn the representations of each (2020) proposed to incorporate a random sparse sentence. Then we use a document Trans- attention mechanism to model the interactions be- former to model the global document context tween a random set of tokens. However, these from these sentence representations. Next, we methods cannot fully model the global context of use another sentence Transformer to enhance sentence modeling using the global document document (Tay et al., 2020). context. Finally, we use hierarchical pooling In this paper, we propose a hierarchical interac- method to obtain document embedding. Exten- tive Transformer (Hi-Transformer)1 for efficient sive experiments on three benchmark datasets and effective long document modeling, which mod- validate the efficiency and effectiveness of Hi- els documents in a hierarchical way to effectively Transformer in long document modeling. reduce the complexity and at the same time can capture the global document context for sentence 1 Introduction modeling. In Hi-Transformer, we first use a sen- Transformer (Vaswani et al., 2017) is an effective tence Transformer to learn the representation of architecture for text modeling, and has been an es- each sentence within a document. Next, we use a sential component in many state-of-the-art NLP document Transformer to model the global docu- models like BERT (Devlin et al., 2019; Radford ment context from these sentence representations. et al., 2019; Yang et al., 2019; Wu et al., 2021). The Then, we use another sentence Transformer to fur- standard Transformer needs to compute a dense ther improve the modeling of each sentence with self-attention matrix based on the interactions be- the help of the global document context. Finally, tween each pair of tokens in text, where the compu- we use hierarchical pooling method to obtain the tational complexity is proportional to the square of document representation. Extensive experiments text length (Vaswani et al., 2017; Wu et al., 2020b). are conducted on three benchmark datasets. The Thus, it is difficult for Transformer to model long results show that Hi-Transformer is both efficient documents efficiently (Child et al., 2019). and effective in long document modeling. There are several methods to accelerate Trans- 1 former for long document modeling (Wu et al., https://github.com/wuch15/HiTransformer. 848 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Short Papers), pages 848–853 August 1–6, 2021. ©2021 Association for Computational Linguistics
Document Embedding Pooling 1 2 Global Context-aware Sentence Embedding … Pooling Pooling Pooling 1,1 1,2 1 2,1 2,2 2 ,1 ,2 … 1, … 2, … , Sentence Transformer Sentence Transformer … Sentence Transformer Hi-Transformer 1 Document Context- 2 aware Sentence … Representations N× Document Transformer + + + 1,1 1,2 1, 1 1 2,1 2,2 2, 2 2 ,1 ,2 , … … … Position Sentence Transformer Embedding Sentence Transformer … Sentence Transformer 1,1 1,2 1, 2,1 2,2 2, ,1 ,2 , … … … Word & Position Embedding Word & Position Embedding Word & Position Embedding Shop the … without [CLS] The royals … is [CLS] Prince Philip … ) [CLS] Sentence 1 Sentence 2 Sentence M Document Figure 1: The architecture of Hi-Transformer. 2 Hi-Transformer modeled by a sentence Transformer. Assume a doc- ument contains M sentences, and the words in the In this section, we introduce our hierarchical in- i-th sentence are denoted as [wi,1 , wi,2 , ..., wi,K ] teractive Transformer (Hi-Transformer) approach (K is the sentence length). We insert a “[CLS]” to- for efficient and effective long document model- ken (denoted as ws ) after the end of each sentence. ing. Its framework is shown in Fig. 1. It uses a This token is used to convey the contextual informa- hierarchical architecture that first models the con- tion within this sentence. The sequence of words in texts within a sentence, next models the document each sentence is first converted into a word embed- contexts by capturing the interactions between sen- ding sequence via a word and position embedding tences, then employs the global document contexts layer. Denote the word embedding sequence for the to enhance sentence modeling, and finally uses hi- i-th sentence as [ei,1 , ei,2 , ..., ei,K , es ]. Since sen- erarchical pooling techniques to obtain document tence length is usually short, we apply a sentence embeddings. In this way, the input sequence length Transformer to each sentence to fully model the in- of each Transformer is much shorter than directly teractions between the words within this sentence. taking the word sequence in document as input, and It takes the word embedding sequence as the input, the global contexts can be fully modeled. The de- and outputs the contextual representations of words, tails of Hi-Transformer are introduced as follows. which are denoted as [hi,1 , hi,2 , ..., hi,K , hsi ]. Spe- 2.1 Model Architecture cially, the representation hsi of the “[CLS]” token Hi-Transformer mainly contains three modules, i.e., is regarded as the sentence representation. sentence context modeling, document context mod- Next, the document-level context is modeled by eling and global document context-enhanced sen- a document Transformer from the representations tence modeling. The sentence-level context is first of the sentences within this document. Denote the 849
embedding sequence of sentences in this document 3 Experiments as [hs1 , hs2 , ..., hsM ]. We add a sentence position embedding (denoted as pi for the i-th sentence) 3.1 Datasets and Experimental Settings to the sentence representations to capture sentence Our experiments are conducted on three bench- orders. We then apply a document Transformer to mark document modeling datasets. The first one these sentence representations to capture the global is Amazon Electronics (He and McAuley, 2016) context of document, and further learn document (denoted as Amazon), which is for product review context-aware sentence representations, which are rating prediction.3 The second one is IMDB (Diao denoted as [rs1 , rs2 , ..., rsM ]. et al., 2014), a widely used dataset for movie re- Then, we use the document context-aware sen- view rating prediction.4 The third one is the MIND tence representations to further improve the sen- dataset (Wu et al., 2020c), which is a large-scale tence context modeling by propagating the global dataset for news intelligence.5 We use the content document context to each sentence. Motivated based news topic classification task on this dataset. by (Guo et al., 2019), we apply another sentence The detailed dataset statistics are shown in Table 1. Transformer to the hidden word representations In our experiments, we use the 300-dimensional and the document-aware sentence representation pre-trained Glove (Pennington et al., 2014) embed- for each sentence. It outputs a document context- dings for initializing word embeddings. We use aware word representation sequence for each sen- two Hi-Transformers layers in our approach and tence, which is denoted as [di,1 , di,2 , ..., di,K , dsi ]. two Transformer layers in other baseline methods.6 In this way, the contextual representations of words We use attentive pooling (Yang et al., 2016) to can benefit from both local sentence context and implement the hierarchical pooling module. The global document context. hidden dimension is set to 256, i.e., 8 self-attention By stacking multiple layers of Hi-Transformer, heads in total and the output dimension of each the contexts within a document can be fully mod- head is 32. Due to the limitation of GPU memory, eled. Finally, we use hierarchical pooling (Wu the input sequence lengths of vanilla Transformer et al., 2020a) techniques to obtain the document em- and its variants for long documents are 512 and bedding. We first aggregate the document context- 2048, respectively. The dropout (Srivastava et al., aware word representations in each sentence into a 2014) ratio is 0.2. The optimizer is Adam (Bengio global context-aware sentence embedding si , and and LeCun, 2015), and the learning rate is 1e-4. then aggregate the global context-aware embed- The maximum training epoch is 3. The models dings of sentence within a document into a unified are implemented using the Keras library with Ten- document embedding d, which is further used for sorflow backend. The GPU we used is GeForce downstream tasks. GTX 1080 Ti with a memory of 11 GB. We use accuracy and macro-F scores as the performance metrics. We repeat each experiment 5 times and 2.2 Efficiency Analysis report both average results and standard deviations. In this section, we provide some discussions on the computational complexity of Hi-Transformer. In 3.2 Performance Evaluation sentence context modeling and document context We compare Hi-Transformer with several base- propagation, the total computational complexity is lines, including: (1) Transformer (Vaswani et al., O(M · K 2 · d), where M is sentence number with a 2017), the vanilla Transformer architecture; (2) document, K is sentence length, and d is the hidden Longformer (Beltagy et al., 2020), a variant of dimension. In document context modeling, the Transformer with local and global attention for computational complexity is O(M 2 · d). Thus, the long documents; (3) BigBird (Zaheer et al., 2020), total computational cost is O(M ·K 2 ·d+M 2 ·d).2 extending Longformer with random attention; (4) Compared with the standard Transformer whose HI-BERT (Zhang et al., 2019), using Transformers computational complexity is O(M 2 · K 2 · d), Hi- 3 Transformer is much more efficient. https://jmcauley.ucsd.edu/data/amazon/ 4 https://github.com/nihalb/JMARS 5 https://msnews.github.io/ 2 6 Note that Hi-Transformer can be combined with other We also tried more Transformer layers for baseline meth- existing techniques of efficient Transformer to further improve ods but do not observe significant performance improvement the efficiency for long document modeling. in our experiments. 850
Dataset #Train #Val #Test Avg. #word Avg. #sent #Class Amazon 40.0k 5.0k 5.0k 133.38 6.17 5 IMDB 108.5k 13.6k 13.6k 385.70 15.29 10 MIND 128.8k 16.1k 16.1k 505.46 25.14 18 Table 1: Statistics of datasets. Amazon IMDB MIND Methods Accuracy Macro-F Accuracy Macro-F Accuracy Macro-F Transformer 65.23±0.38 42.23±0.37 51.98±0.48 42.76±0.49 80.96±0.22 59.97±0.24 Longformer 65.35±0.44 42.45±0.41 52.33±0.40 43.51±0.42 81.42±0.25 62.68±0.26 BigBird 66.05±0.48 42.89±0.46 52.87±0.51 43.79±0.50 81.81±0.29 63.44±0.31 HI-BERT 66.56±0.32 42.65±0.34 52.96±0.46 43.84±0.46 81.89±0.23 63.63±0.20 Hi-Transformer 67.24±0.35 43.69±0.32 53.78±0.49 44.54±0.47 82.51±0.25 64.22±0.22 Table 2: The results of different methods on different datasets. Method Complexity sults indicate the efficiency and effectiveness of Transformer O(M 2 · K 2 · d) Hi-Transformer. Longformer O(T · M · K · d) BigBird O(T · M · K · d) 3.3 Model Effectiveness HI-BERT O(M · K 2 · d + M 2 · d) Nest, we verify the effectiveness of the global doc- Hi-Transformer O(M · K 2 · d + M 2 · d) ument contexts for enhancing sentence modeling in Hi-Transformer. We compare Hi-Transformer Table 3: Complexity of different methods. K is sen- and its variants without global document contexts tence length, M is the number of sentences in a docu- in Fig. 2. We find the performance consistently ment, T is the number of positions for sparse attention, declines when the global document contexts are and d is the hidden dimension. not encoded into sentence representations. This is because the local contexts within a single sentence at both word and sentence levels. The results of may be insufficient for accurate sentence model- these methods on the three datasets are shown in Ta- ing, and global contexts in the entire document ble 2. We find that Transformers designed for long can provide rich complementary information for documents like Hi-Transformer and BigBird out- sentence understanding. Thus, propagating the doc- perform the vanilla Transformer. This is because ument contexts to enhance sentence modeling can vanilla Transformer cannot handle long sequence improve long document modeling. due to the restriction of computation resources, and 3.4 Influence of Text Length truncating the input sequence leads to the loss of much useful contextual information. In addition, Then, we study the influence of text length on the Hi-Transformer and HI-BERT outperform Long- model performance and computational cost. Since former and BigBird. This is because the sparse the documents in the MIND dataset are longest, attention mechanism used in Longformer and Big- we conduct experiments on MIND to compare the Bird cannot fully model the global contexts within model performance as well as the training time a document. Besides, Hi-Transformer achieves the per layer of Transformer and Hi-Transformer un- best performance, and the t-test results show the der different input text length7 , and the results are improvements over baselines are significant. This shown in Fig. 3. We find the performance of both is because Hi-Transformer can incorporate global methods improves when longer text sequences are document contexts to enhance sentence modeling. used. This is intuitive because more information We also compare the computational complexity can be incorporated when longer text is input to of these methods in Table 3. The complexity of the model for document modeling. However, the Hi-Transformer is much less than the vanilla Trans- computational cost of Transformer grows very fast, former and is comparable with other Transformer 7 The maximum length of Transformer is 512 due to GPU variants designed for long documents. These re- memory limitation. 851
68.0 44.0 85.0 43.69 67.5 67.24 43.5 83.0 Accuracy 67.0 43.0 81.0 Accuracy Macro-F 42.70 66.61 66.5 42.5 79.0 66.0 42.0 Transformer 77.0 Hi-Transformer Hi-Transformer 65.5 41.5 - Global document context 75.0 65.0 41.0 64 128 256 512 1024 2048 Accuracy Macro-F Text Length (a) Amazon. (a) Accuracy. 54.5 45.0 68.0 44.54 54.0 53.78 44.5 64.0 53.5 43.92 44.0 Macro-F 60.0 Accuracy Macro-F 53.13 53.0 43.5 56.0 52.5 43.0 52.0 Transformer Hi-Transformer Hi-Transformer 52.0 42.5 - Global document context 48.0 51.5 42.0 64 128 256 512 1024 2048 Accuracy Macro-F Text Length (b) IMDB. (b) Macro-F. 83.0 65.0 0.6 Transformer Time per Iteration/s 82.51 82.5 64.5 0.5 Hi-Transformer 64.22 82.0 81.92 64.0 0.4 Accuracy Macro-F 63.56 0.3 81.5 63.5 0.2 81.0 63.0 Hi-Transformer 0.1 80.5 62.5 - Global document context 0.0 80.0 62.0 64 128 256 512 1024 2048 Accuracy Macro-F Text Length (c) MIND. (c) Training time per layer. Figure 2: Effectiveness of global document context Figure 3: Influence of input text length on performance propagation in Hi-Transformer . and training time on the MIND dataset. which limits its maximal input text length. Dif- contexts in sentence modeling to help understand ferent from Transformer, Hi-Transformer is much document content accurately. Extensive experi- more efficient and meanwhile can achieve better ments on three benchmark datasets validate the performance with longer sequence length. These re- efficiency and effectiveness of Hi-Transformer in sults further verify the efficiency and effectiveness long document modeling. of Hi-Transformer in long document modeling. 4 Conclusion Acknowledgments In this paper, we propose a Hi-Transformer ap- This work was supported by the National Natural proach for both efficient and effective long docu- Science Foundation of China under Grant numbers ment modeling. It incorporates a hierarchical ar- U1936216, U1936208, U1836204, and U1705261. chitecture that first learns sentence representations We are grateful to Xing Xie, Shaoyu Zhou, Dan and then learns document representations. It can Shen, and Zhisong Wang for their insightful com- effectively reduce the computational complexity ments and suggestions on this work. and meanwhile be aware of the global document 852
References Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS, pages 5998–6008. Iz Beltagy, Matthew E Peters, and Arman Cohan. Sinong Wang, Belinda Li, Madian Khabsa, Han 2020. Longformer: The long-document transformer. Fang, and Hao Ma. 2020. Linformer: Self- arXiv preprint arXiv:2004.05150. attention with linear complexity. arXiv preprint Yoshua Bengio and Yann LeCun. 2015. Adam: A arXiv:2006.04768. method for stochastic optimization. In ICLR. Chuhan Wu, Fangzhao Wu, and Yongfeng Huang. Rewon Child, Scott Gray, Alec Radford, and 2021. DA-transformer: Distance-aware transformer. Ilya Sutskever. 2019. Generating long se- In NAACL-HLT, pages 2059–2068. quences with sparse transformers. arXiv preprint arXiv:1904.10509. Chuhan Wu, Fangzhao Wu, Tao Qi, Xiaohui Cui, and Yongfeng Huang. 2020a. Attentive pooling with Jacob Devlin, Ming-Wei Chang, Kenton Lee, and learnable norms for text representation. In ACL, Kristina Toutanova. 2019. Bert: Pre-training of deep pages 2961–2970. bidirectional transformers for language understand- ing. In NAACL-HLT, pages 4171–4186. Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- Huang. 2020b. Improving attention mechanism der J Smola, Jing Jiang, and Chong Wang. 2014. with query-value interaction. arXiv preprint Jointly modeling aspects, ratings and sentiments for arXiv:2010.03766. movie recommendation (jmars). In KDD, pages 193–202. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Jianfeng Gao, Winnie Wu, et al. 2020c. Mind: A Xiangyang Xue, and Zheng Zhang. 2019. Star- large-scale dataset for news recommendation. In transformer. In NAACL-HLT, pages 1315–1325. ACL, pages 3597–3606. Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song one-class collaborative filtering. In WWW, pages Han. 2019. Lite transformer with long-short range 507–517. attention. In ICLR. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. Liu Yang, Mingyang Zhang, Cheng Li, Michael Ben- 2019. Reformer: The efficient transformer. In dersky, and Marc Najork. 2020. Beyond 512 to- ICLR. kens: Siamese multi-depth transformer-based hier- archical encoder for long-form document matching. Jeffrey Pennington, Richard Socher, and Christopher In CIKM, pages 1725–1734. Manning. 2014. Glove: Global vectors for word rep- resentation. In EMNLP, pages 1532–1543. Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Jiezhong Qiu, Hao Ma, Omer Levy, Wen-tau Yih, bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Sinong Wang, and Jie Tang. 2020. Blockwise Xlnet: Generalized autoregressive pretraining for self-attention for long document understanding. In language understanding. In NeurIPS, pages 5753– EMNLP: Findings, pages 2555–2565. 5763. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Dario Amodei, and Ilya Sutskever. 2019. Language Alex Smola, and Eduard Hovy. 2016. Hierarchical models are unsupervised multitask learners. OpenAI attention networks for document classification. In blog, 1(8):9. NAACL-HLT, pages 1480–1489. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Joshua Ainslie, Chris Alberti, Santiago Ontanon, Dropout: a simple way to prevent neural networks Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, from overfitting. JMLR, 15(1):1929–1958. et al. 2020. Big bird: Transformers for longer se- quences. arXiv preprint arXiv:2007.14062. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. 2020. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732. Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. Hi- bert: Document level pre-training of hierarchical Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob bidirectional transformers for document summariza- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz tion. In ACL, pages 5059–5069. 853
You can also read