RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder

RetroMAE: Pre-training Retrieval-oriented Transformers via Masked
 Zheng Liu1†, Yingxia Shao2
 1: Huawei Technologies Ltd. Co., Shenzhen, China
 2: Beijing University of Posts and Telecommunications, Beijing, China,

 Abstract the backbone of dual-encoders (Karpukhin et al.,
 2020; Xiong et al., 2020; Luan et al., 2021). How-
 Pre-trained models have demonstrated supe- ever, the mainstream pre-trained models, such as
 rior power on many important tasks. However,
arXiv:2205.12035v1 [cs.CL] 24 May 2022

 BERT (Devlin et al., 2019), RoBERTa (Liu et al.,
 it is still an open problem of designing effec-
 2019), XLNET (Yang et al., 2019), and T5 (Raffel
 tive pre-training strategies so as to promote the
 models’ usability on dense retrieval. In this et al., 2019), are typically pre-trained with token-
 paper, we propose a novel pre-training frame- level tasks, like masked language modeling and
 work for dense retrieval based on the Masked sequence-to-sequence. As a result, the sentence-
 Auto-Encoder, known as RetroMAE. Our pro- level representation capability is not be properly
 posed framework is highlighted for the fol- developed, which will probably restrict the quality
 lowing critical designs: 1) a MAE based pre- of dense retrieval.
 training workflow, where the input sentence
 is polluted on both encoder and decoder side Aware of the above defect, continuous effort has
 with different masks, and original sentence been made so as to better prepare pre-trained mod-
 is reconstructed based on both sentence em- els for dense retrieval tasks. One popular strat-
 bedding and masked sentence; 2) asymmetric egy is to take advantage of self contrastive learn-
 model architectures, with a large-scale expres- ing (SCL) (Chang et al., 2020; Izacard et al., 2021;
 sive transformer for sentence encoding and a Ni et al., 2021; Xu et al., 2022), where the mod-
 extremely simplified transformer for sentence
 els are pre-trained to recognize manually created
 reconstruction; 3) asymmetric masking ratios,
 with a moderate masking on the encoder side positive samples from data augmentation. How-
 (15%) and an aggressivev masking ratio on the ever, the SCL based methods can be limited by
 decoder side (50∼90%). We pre-train a BERT the data augmentation quality, and usually call for
 like encoder on English Wikipedia and Book- massive amounts of negative samples (Xu et al.,
 Corpus, where it notably outperforms the ex- 2022; He et al., 2020). Another popular strategy
 isting pre-trained models on a wide range of leverages auto-encoding (AE), which is freed from
 dense retrieval benchmarks, like MS MARCO,
 the requirements on data augmentation and nega-
 Open-domain Question Answering, and BEIR.
 tive sampling. The AE based strategy is differen-
 1 Introduction tiated in terms of reconstruction tasks: the exist-
 ing methods leverage masked language modeling
 Dense retrieval plays the fundamental role in many (MLM) (Gao and Callan, 2021), replaced token
 important web applications, like search engines detection (RTD) (Chuang et al., 2022), and auto-
 and recommender systems (Xiong et al., 2020; Qu regression ((Lu et al., 2021; Wang et al., 2021)),
 et al., 2020). By having semantically correlated whose impacts on reconstruction difficulty and
 query and document represented as spatially close data efficiency are highly different. So far, it is
 embeddings with dual-encoders, dense retrieval still an open problem of exploring more effective
 can be efficiently conducted via approximate near- AE based pre-training algorithms.
 est neighbour search, e.g., product quantization In this paper, we propose a novel masked auto-
 (Jegou et al., 2010) and HNSW (Malkov and encoding (MAE) framework to pre-train retrieval-
 Yashunin, 2018). In recent years, the pre-trained oriented language models, known as RetroMAE
 language models have been widely utilized as (Figure 1). The proposed pre-training framework
 †. Part of the work was done when Zheng Liu was affiliated not only simplifies the existing AE based methods,
 with Microsoft Research Asia. but also gives rise to surprisingly competitive per-
sentence Norwegian Forest cat is a breed of dome-
 embedding stic cat originating in Northern Europe of the input during the decoding process, forc-
 ing the in-depth semantics to be encoded within
 the sentence embedding so as to ensure the recon-
 Encoder Decoder
 struction quality. Besides, the decoder is merely
 an one-layer transformer; the extremely simplified
 network further increases the difficulty of auto-
Norwegian Forest cat is a [breed] of dom- [Norwegian Forest cat] is [a breed ] of dom-
estic cat [originating] in Northern Europe estic [cat originating ] in [Northern Europe] encoding. Secondly, it ensures training signals
 (Moderate mask: 15%) (Aggressive mask: 70%) to be fully generated from each pre-training sen-
 tence. For typical MLM style methods, the train-
 Norwegian Forest cat is a breed of dom-
 estic cat originating in Northern Europe ing signals may only be derived from 15% of the
 (Original sentence) input tokens. Whereas for RetroMAE, the training
Figure 1: RetroMAE. The encoder adopts a large-scale signals can be derived from the majority of the to-
network, whose input is moderately masked. The de- kens. Besides, knowing that the decoder only con-
coder is extremely simple in structure (one-layer trans- tains one-single layer, we propose the enhanced
former); the input is aggressively masked; the original decoding with two-stream attention (Yang et al.,
sentence is reconstructed based on the sentence embed- 2019) and position-specific attention mask (Dong
ding and the aggressively masked input. et al., 2019), where the training signals can be de-
 rived from the entire input tokens.
formances on downstream dense retrieval tasks. In We perform comprehensive experimental stud-
particular, RetroMAE is featured for the following ies with popular dense retrieval benchmarks, such
critical components and strategies. as MS MARCO (Nguyen et al., 2016) and BEIR
 • Masked auto-encoding. The pre-training fol- (Thakur et al., 2021). According to the evalua-
lows a novel masked auto-encoding process. The tion results, RetroMAE notably improves both in-
input sentence is polluted twice with two different domain and out-of-domain performance in com-
masks. One masked input is used by the encoder, parison with the existing generic and retrieval-
where the sentence embedding is generated. The oriented pretrained language models.
other masked input is used by the decoder; and to-
gether with the sentence embedding, the original 2 Related works
sentence is reconstructed.
 The related works are reviewed from two aspects:
 • Asymmetric structure. RetroMAE adopts dense retrieval and pretrained language models.
an asymmetric model structure. The encoder is Dense retrieval becomes increasingly popular
a large-scale transformer, e.g., BERT, which is in recent years. It represents query and docu-
learned to generate a discriminative embedding for ment as embeddings within the same latent space,
the input sentence. In contrast, the decoder fol- where the semantic relationship between query
lows an extremely simplified structure, e.g., one and document can be measured based on the em-
single layer of transformer, which learns to recon- bedding similarity. As a result, dense retrieval can
struct an masked sentence. be efficiently conducted leveraging approximate
 • Asymmetric masking ratios. The encoder’s nearest neighbour search, like HNSW (Malkov
input sentence is masked at a moderate ratio: 15%, and Yashunin, 2018) and Product Quantization
which is the same as the typical MLM strategies. (Jegou et al., 2010). The encoding model, i.e.,
However, the decoder’s input sentence is masked the dual encoder, is fundamental to the retrieval
with a much more aggressive ratio: 50∼90%. quality. With the progress of deep learning, the
 The above designs of RetroMAE is favorable model architecture is being continuously evolving,
to pre-training effectiveness thanks to the follow- from simple linear transformations (Huang et al.,
ing reasons. Firstly, the auto-encoding task be- 2013) to CNNs (Shen et al., 2014), RNNs (Kiros
comes much more challenging compared with et al., 2015), etc. The adventure of large-scale
the existing methods. The auto-regression may pre-trained language models, e.g., BERT (Devlin
attend to the prefix during the decoding process; et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raf-
and the conventional MLM merely has a small fel et al., 2019), brings about a major leap-forward
portion (15%) of the input tokens masked. By for dense retrieval. Thanks to the equipment of
comparison, RetroMAE aggressively masks most such big expressive models, the retrieval quality
has been substantially enhanced (Karpukhin et al., et al., 2022), etc, which are quite differentiated in
2020; Luan et al., 2021; Lin et al., 2021). terms of data efficiency and reconstruction diffi-
 culty. For example, MLM learns from the masked
 One important feature about the pre-trained lan-
 tokens, which only consist 15% of each sentence;
guage models is that the pre-training tasks are
 in contrast, the training signals can be derived
highly differentiated. One common practice is
 from all the input tokens with auto-regression. Be-
the masked language modeling (MLM) as BERT
 sides, the conventional auto-encoding leverages a
(Devlin et al., 2019) and RoBERTa (Liu et al.,
 large-scale decoder (Li et al., 2020; Wang et al.,
2019), where the masked tokens are predicted
 2021); while in (Lu et al., 2021), a smaller decoder
based on the rest of context. The basic MLM
 is used, which increases the reconstruction diffi-
is extended by tasks like entity masking, phrase
 culty. Ideally, the auto-encoding should be data
masking and span masking (Sun et al., 2019; Joshi
 efficient, which ensures the pre-training corpus to
et al., 2020), which substantially contribute to
 be fully leveraged; meanwhile, it should also be
the sequence labeling applications, such as en-
 made sufficiently challenging, which forces sen-
tity resolution and question answering. Besides,
 tence embeddings to capture the in-depth seman-
tasks like auto-regression (Radford et al., 2018;
 tics about the input sentences.
Yang et al., 2019) and sequence-to-sequence (Raf-
fel et al., 2019; Lewis et al., 2019) are also uti- 3 Methodology
lized, where the pre-trained models can be better
fit into NLG related applications. However, all We propose the following masked auto-encoding
these generic approaches focus on token-level lan- workflow for the retrieval-oriented pre-training.
guage modeling, where the sentence representa- The model consists two components: a BERT-
tion capability is not effectively developed (Chang like transformer Φenc (·) for sentence representa-
et al., 2020). As a result, it usually calls for a tion, and a light-weight transformer Φdec (·) for
large scale of labeled data (Nguyen et al., 2016; sentence reconstruction. An input sentence X is
Kwiatkowski et al., 2019) and sophisticated fine- masked as X̃enc and encoded for the sentence em-
tuning strategies (Xiong et al., 2020; Qu et al., bedding hX̃ ; the sentence is masked once again
2020) to guarantee the pre-trained models’ perfor- as X̃dec , and together with hX̃ , the original sen-
mance on dense retrieval. tence X is decoded. Detailed elaborations about
 the above workflow is illustrated as follows.
 To mitigate the above problem, recent works
target on pre-training retrieval oriented language 3.1 Encoding
models. The mainstream approaches are based on
 The input sentence X is polluted as X̃enc for the
two different strategies: self-contrastive learning
 encoding stage, where a small fraction of its to-
(SCL) and auto-encoding (AE). The SCL based
 kens are randomly replaced by the special token
methods (Chang et al., 2020; Guu et al., 2020;
 [M] (Figure 1. A). We apply a moderate masking
Xu et al., 2022) require specific data augmenta-
 ratio, 15% by default, where the majority of in-
tion, e.g., inverse cloze task (ICT), where positive
 formation about the original sentence can be pre-
samples are generated for each anchor sentence;
 served. Then, the encoder Φenc (·) is used to map
then, the language model is trained to discriminate
 the polluted input as the sentence embedding hX̃ :
the positive samples from the negative ones via
contrastive learning. However, the self-contrastive
 hX̃ ← Φenc (X̃enc ). (1)
learning usually calls for huge amounts of nega-
tive samples, which is computationally challeng- We leverage a BERT-like transformer for the en-
ing; besides, the pre-training effect can be severely coding operation, i.e., 12 layers and 768 hidden-
restricted by the quality of data augmentation. The dimension, where the in-depth semantic about the
AE based methods are free from these restrictions, input may get effectively captured. Following
where the language models are learned to recon- the common practice, we select the [CLS] token’s
struct the input sentence based on the sentence em- last-layer hidden state as the sentence embedding.
bedding. The existing methods utilize various re-
construction tasks, such as MLM (Gao and Callan, 3.2 Decoding
2021), auto-regression (Lu et al., 2021; Wang The input sentence X is polluted once again as
et al., 2021), and token replace detection (Chuang X̃dec for the decoding stage (Figure 2. B). The
 & # $ ! # " $

 1 2 3 4
 Encoder 3


 P* P# P$ P& P% P# P$ P& P% P# P$ P& P% mask
CLS ! [M] " $ ! [M] " [M] ! # " $

 ,'() ,+') "! + " + 

 (A) Encoding (B) Decoding (C) Enhanced decoding

Figure 2: Masked auto-encoding workflow. (A) Encoding: the input is moderately masked and encoded as the
sentence embedding (the green rectangle). (B) Decoding: the input is aggressively masked, and joined with the
sentence embedding to reconstruct the masked tokens (the shadowed tokens). (C) Enhanced encoding: all input
tokens are reconstructed based on the sentence embedding and the visible context in each row (defined in Eq. 7);
the main diagonal positions are filled with −∞ (grey), and positions for the visible context are filled with 0 (blue).

masking ratio is much more aggressive compared same context, i.e., hX̃ ⊕ EX̃dec + P. We argue
with the one used by the encoder, where up to that the pre-training effect can be further enhanced
50∼90% of the input tokens will be masked. The if 1) more training signals can be derived from the
masked input is joined with the sentence embed- input, and 2) the reconstruction task can be per-
ding, based on which the original input sentence formed based on different contexts. To this end,
can be reconstructed by the decoder. Particularly, we propose the following enhanced decoding in-
the sentence embedding and the masked input are spired by two-stream attention (Yang et al., 2019)
combined into the following sequence: and position-specific attention mask (Dong et al.,
 2019). Particularly, we will have the following
 hX̃ ⊕ EX̃dec = [hX̃ , ex1 , ..., exN ]. (2) two input sequences: H1 and H2 , for decoding
In the above equation, e∗ denotes the word embed- (Figure 2. C):
ding; x1 equals to the original token value if it is H1 ← hX̃ + P, H2 ← EX + P, (4)
not masked, otherwise [M]. We add extra position
embeddings P to the above sequence, which form where hX̃ is the sentence embedding, EX is the
the decoder’s input. Finally, the decoder Φdec is word embedding for the original tokens, and P is
learned to reconstruct the original sentence X by the position embedding. Then, we introduce the
optimizing the following objective: position-specific attention mask M ∈ L×L, based
 X on which the self-attention is performed as:
 min . CE(xi |hX̃ ⊕ EX̃dec + P), (3)
 xi ∈masked Q = H1 WQ , K = H2 WK , V = H2 WV ;
where CE indicates the cross-entropy loss. We 0, can be attended,
 Mij =
adopt a light-weight network, i.e., an one-layer −∞, masked; (5)
transformer by default, for the decoding operation. QT K
Given the aggressively masked input and the ex- A = softmax( √ + M)V.
tremely simplified decoding network, it will force
the generation of high-quality sentence embedding The output A, together with hX̃ (because of the
such that the original input can be reconstructed residual connection) are used to reconstruct the
with good fidelity. original input (other operations in transformers,
 like layer-norm and FFN, are omitted for simplic-
3.3 Enhanced Decoding ity.) Finally, the objective will be optimized:
One shortcoming about the above decoding pro- X
cess is that the training signals, i.e., the cross- min . CE(xi |A, hX̃ ) (6)
 xi ∈X
entropy loss, may only be derived from the masked
tokens, instead of the entire input tokens. Besides, Knowing that the decoder only consists of one sin-
each masked token is reconstructed based on the gle transformer layer, each token xi can only be
Algorithm 1: RetroMAE 4.1 Experiment Settings
1 begin The following datasets are utilized for the pre-
2 X̃enc ← mask(X); training and evaluation of RetroMAE.
3 hX̃ ← Φenc (X̃enc ); • Pre-training. We reuse the same pre-training
4 H1 ← hX̃ + P; % for Q corpus as the ones utilized by BERT (Devlin et al.,
5 H2 ← EX + P; % for K and V 2019), i.e., the English Wikipedia and Book-
6 M ← Eq. 7; Corpus. Both datasets are also frequently lever-
7 A ← based on H1 , H2 , M as Eq. 5; aged by previous works on retrieval-oriented pre-
8 model update w.r.t. Eq. 6; training, such as SEED (Lu et al., 2021) and Con-
 denser (Gao and Callan, 2021).
 • Supervised evaluation. We make use of the
 MS MARCO passage retrieval dataset (Nguyen
reconstructed based on the information which are
 et al., 2016) to evaluate RetroMAE’s performance
visible to the i-th row of matrix M. In this place,
 after supervision. It is one of the large-scale
the following rules are applied to generate the at-
 datasets for dense retrieval evaluation, which con-
tention mask matrix M:
 ( sists of real-world questions from Bing search.
 0, xj ∈ s(X6=i ), The questions are paired with their corresponding
 Mij = (7) passages from web documents, where human an-
 −∞, otherwise.
 notated ground-truth answers to the questions are
In the above equation, s(X6=i ) represents the ran- included. RetroMAE is fine-tuned on its training
dom sampling of the input tokens. The sampled set and evaluated on its dev set.
tokens will be visible when reconstructing xi . The • Zero-shot evaluation. We evaluate Retro-
diagonal elements, i.e., xi for the i-th row, will al- MAE’s zero-shot retrieval performance on top of
ways be excluded, which means they will always the recently released BEIR benchmark (Thakur
be masked; therefore, each token cannot attend to et al., 2021). It contains a total of 18 datasets,
itself during the reconstruction stage. covering dense retrieval tasks across different do-
 We summarize the pre-training workflow of the mains, such as question answering, fact checking,
encoding and enhanced decoding as Algorithm 1, bio-medical retrieval, news retrieval, and social
where the following features need to be empha- media retrieval, etc.
sized. Firstly, the reconstruction task is challeng- We consider three categories of baseline meth-
ing given the aggressive masking ratio and the ods1 in our experimental studies.
lightweight network about the decoder. Secondly, • Generic models. The following generic pre-
we may derive abundant pre-training signals from trained language models are included in our ex-
the unsupervised corpus since all tokens within periments. 1) BERT (Devlin et al., 2019), which
each input sentence can be used for the reconstruc- is the most popular pre-trained language model in
tion task. Finally, the pre-training is made simple practice. It is also widely used as the backbone
and efficient: 1) there are no requirements on so- for the fine-tuning of dense retrievers (Karpukhin
phisticated data augmentation and negative sam- et al., 2020; Xiong et al., 2020). 2) RoBERTa
pling, and 2) the training cost is maintained to be (Liu et al., 2019), which is an enhanced repli-
comparable to the conventional BERT/RoBERTa cation of BERT with improved training settings
style pre-training due to the simplicity of decoder. and substantially augmented training data. 3)
 ELECTRA (Clark et al., 2020), introduces the
4 Experiment generator-discriminator framework and the token
 replacement prediction task to further improve the
We explore the following issues in our experimen-
 pre-training effect.
tal studies. 1) RetroMAE’s impact on dense re-
 • Constrastive learning. The following con-
trieval, in comparison with the generic pre-trained
 trastive learning based methods are considered.
language models and the retrieval-oriented pre-
 1) SimCSE (Gao et al., 2021), where the lan-
trained models. 2) The impact resulted from the
 guage model is learned to discriminate different
four key factors in RetroMAE, the enhanced de-
 views of the anchor sentence augmented by drop-
coding, the size of decoder, the decoder’s masking
ratio, and the encoder’s masking ratio. 1. We use the original checkpoints released by the authors.
Methods MRR@10 MRR@100 Recall@10 Recall@100 Recall@1000
 BERT 0.3170 0.3278 0.5801 0.8570 0.9598
 RoBERTa 0.3139 0.3245 0.5595 0.8155 0.9351
 ELECTRA 0.3136 0.3258 0.5638 0.8478 0.9579
 SimCSE 0.3191 0.3307 0.5833 0.8537 0.9602
 LaPraDoR 0.3193 0.3307 0.5907 0.8653 0.9699
 SEED 0.3264 0.3374 0.5913 0.8535 0.9539
 DiffCSE 0.3202 0.3311 0.5832 0.8561 0.9607
 Condenser 0.3357 0.3471 0.6082 0.8770 0.9683
 RetroMAE 0.3501 0.3606 0.6306 0.8890 0.9757

 Table 1: Supervised dense retrieval performance on MS MARCO.

out. Despite its simplicity, SimCSE achieves quite 4. The training is performed on a cluster of 8×
promising results on semantic textual similarity Nvidia A100 (40GB) GPUs. The model is imple-
tasks. 2) LaPraDoR2 (Xu et al., 2022), which is a mented with PyTorch 1.8 and HuggingFace trans-
recently proposed unsupervised retriever on top of formers 4.16. The evaluation settings are speci-
contrastive learning. It notably enhances the pre- fied as follows. Both RetroMAE and the baselines
vious ICT based methods (Guu et al., 2020; Chang are fine-tuned via DPR (Gao and Callan, 2021)
et al., 2020) with the alternative training of query for their supervised performance on MS MARCO.
and document encoder, where the scale of negative The fine-tuned models on MS MARCO are di-
samples can be greatly increased. rectly applied to the BEIR benchmark for their
 • Auto-encoding. We also make compari- zero-shot retrieval performances.
son with three pre-trained models following the
auto-encoding framework. 1) Condenser (Gao 4.2 Experiment Analysis
and Callan, 2021), where the sentence embed- 4.2.1 Main Results
ding is joined with the intermediate hidden-states The supervised dense retrieval performances on
from encoder for the masking language modeling MS MARCO are measured by MRR and Recall,
task. 2) SEED, in which the sentence embed- whose results are shown as Table 1. The baselines
ding is used to reconstruct the original sentence methods are partitioned into three groups accord-
via auto-regression. 3) DiffCSE (Chuang et al., ing to whether they are generic pre-trained mod-
2022), which is a combination of SimCSE and els, contrastive-learning based models, or auto-
auto-encoding based pre-training: for one thing, encoding based models. It can be observed that
the sentence embedding is learned by contrastive RetroMAE achieves much better retrieval perfor-
learning as SimCSE; for another thing, the sen- mance than the rest of baseline methods; for ex-
tence embedding is applied to the ELETRA-style ample, RetroMAE outperforms the strongest base-
pre-training, i.e., the prediction of replaced tokens line performance by +1.44% on MRR@10, and
generated by a generator. by +2.24% on Recall@10. The zero-shot dense
 The implementation settings are specified as retrieval performances on BEIR benchmark are
follows. The encoder backbone of RetroMAE is are measured by NDCG@10, whose results are
the same as BERT base, i.e., with 12 layers, 768 shown as Table 2. Similar as what we find in
hidden-dimensions, and a vocabulary of 30522 to- Table 1, RetroMAE maintains notable advantages
kens. The decoder is a one-layer transformer. over the baseline methods on almost every evalu-
The encoder masking ratio is 0.5, and the decoder ation dataset. All these observations verify Retro-
masking ratio is 0.15. The model is trained for 8 MAE’s effectiveness as a retrieval-oriented pre-
epochs, with the AdamW optimizer, a batch-size training framework.
of 32 for each device, and a learning rate of 1e- Besides the overall observations, more detailed
 analysis about Table 1 and 2 are made as follows.
2. The original LaPraDoR is an ensemble of dense retriever
 and BM25. We preserve its released dense retriever for Firstly, the auto-encoding based methods, both
 our experimental studies. RetroMAE and baselines like SEED and Con-
Datasets BERT LaPraDoR SimCSE DiffCSE SEED Condenser RetroMAE
 TREC-COVID 0.649 0.495 0.524 0.492 0.612 0.754 0.756
 BioASQ 0.262 0.239 0.264 0.258 0.297 0.317 0.383
 NFCorpus 0.257 0.283 0.250 0.259 0.256 0.278 0.301
 NQ 0.438 0.415 0.412 0.412 0.425 0.459 0.490
 HotpotQA 0.478 0.488 0.502 0.499 0.528 0.537 0.638
 FiQA-2018 0.237 0.266 0.240 0.229 0.244 0.261 0.301
 Signal-1M(RT) 0.216 0.245 0.264 0.260 0.246 0.258 0.278
 TREC-NEWS 0.362 0.206 0.368 0.363 0.335 0.353 0.398
 Robust04 0.364 0.310 0.353 0.343 0.348 0.352 0.433
 ArguAna 0.357 0.503 0.436 0.468 0.347 0.375 0.481
 Touche-2020 0.270 0.178 0.178 0.168 0.180 0.223 0.243
 CQADupStack 0.284 0.326 0.295 0.305 0.285 0.316 0.382
 Quora 0.782 0.843 0.848 0.850 0.849 0.855 0.856
 DBPedia 0.298 0.328 0.304 0.303 0.324 0.331 0.385
 SCIDOCS 0.115 0.145 0.125 0.125 0.117 0.136 0.150
 FEVER 0.684 0.518 0.651 0.641 0.653 0.682 0.719
 Climate-FEVER 0.205 0.172 0.222 0.200 0.176 0.199 0.214
 SciFact 0.504 0.483 0.545 0.523 0.556 0.570 0.648
 Avg. Performance 0.376 0.358 0.377 0.372 0.377 0.403 0.448

 Table 2: Zero-shot dense retrieval performances on BEIR benchmark (measured by NDCG@10).

denser, are generally better than other methods. little to the models’ dense retrieval capability;
Such an observation indicates that auto-encoding thus, retrieval-oriented pre-training is needed.
is probably a more suitable paradigm for the pre-
training of retrieval-oriented language models. We 4.2.2 Ablation Studies
would also attribute RetroMAE’s advantage over We ablate RetroMAE based on MS MARCO,
other auto-encoding based methods to its higher where the following factors are analyzed: 1) de-
data-efficiency and reconstruction difficulty. More coding method, 2) decoder’s size, 3) decoder’s
analysis about this point will be made in the fol- masking ratio, 4) encoder’s masking ratio. The ex-
lowing discussions. periment results are shown as Table 3, where the
 Secondly, the contrastive learning based meth- following observations can be made.
ods merely bring very limited improvements over Firstly of all, we analyze the impact from the
the generic pre-trained models, as can be ob- decoding method. It can be observed that the en-
served from the comparison between SimCSE, hanced decoding outperforms the basic decoding
LaPraDoR and BERT in Table 1 and 2. In fact, with notable advantages. Such an empirical ad-
similar observations are also made by previous vantage can be explained by the higher data effi-
study (Gao and Callan, 2021): although con- ciency of the enhanced decoding. Particularly, the
trastive learning may equip the pre-trained models basic decoding (Section 3.2) samples 50% of to-
with preliminary capability on dense retrieval, the kens (the default masking ratio for decoder) for
advantage is almost wiped out when the models reconstruction, and all of the masked tokens are
are fine-tuned with labeled data. predicted based on the same context. In contrast,
 Thirdly, despite that RoBERTa and ELECTRA the enhanced decoding (Section 3.3) may use all
are proved to be more effective than BERT on of the input tokens for reconstruction, and each of
generic NLU tasks, like GLUE and MRC, they are the masked tokens is predicted based on a unique
no better than BERT on dense retrieval scenarios. context sampled as Eq. 7. Therefore, the enhanced
Such an observation validates once again that the decoding may obtain augmented and diverse train-
conventional token-level pre-training contributes ing signals from the input data.
Factor Setting MRR@10 MRR@100 Recall@10 Recall@100 Recall@1000
 Basic 0.3434 0.3548 0.6166 0.8832 0.9722
 Enhanced 0.3501 0.3606 0.6306 0.8890 0.9757
 Hde = 1 0.3455 0.3568 0.6212 0.8818 0.9716
 Height (De)
 Hde = 2 0.3434 0.3548 0.6166 0.8832 0.9722
 γde = 0.3 0.3479 0.3596 0.6242 0.8890 0.9739
 γde = 0.5 0.3501 0.3606 0.6306 0.8890 0.9757
 Mask (De)
 γde = 0.7 0.3490 0.3604 0.6289 0.8910 0.9755
 γde = 0.9 0.3491 0.3602 0.6272 0.8870 0.9745
 γen = 0.15 0.3501 0.3606 0.6306 0.8890 0.9757
 Mask (En) γen = 0.3 0.3553 0.3665 0.6356 0.8922 0.9763
 γen = 0.5 0.3537 0.3649 0.6337 0.8910 0.9742

 Table 3: Ablation studies of RetroMAE based on MS MARCO.

 Secondly, we analyze the impact from the size 4.2.3 Discussion
of decoder. In this place, we use two different de- We draw the following conclusions based on our
coders for comparison: 1) the decoder with one- experimental findings in this paper. Firstly, the
single transformer layer (Hde = 1), and 2) the de- auto-encoding framework demonstrates strong po-
coder with two transformer layers (Hde = 2). It tential in pre-training retrieval-oriented language
can be found that the smaller decoder, which in- models, and RetroMAE brings in substantially im-
creases the difficulty of input reconstruction, gives provements over the existing auto-encoding based
rise to better empirical performances. methods. Secondly, RetroMAE’s performance is
 optimized by the enhanced decoding strategy, the
 Thirdly, we further analyze the impact from dif- simplified network of decoder, and the proper set-
ferent masking ratios of the decoder (γde ), which ting of the masking ratios.
is increased from 0.3 to 0.9. It can be observed
that the decoder with an aggressive masking ra- 5 Conclusion
tio, i.e., 0.5∼0.7, results in a relatively better em- In this paper, we propose RetroMAE, which pre-
pirical performance. However, further increasing trains retrieval-oriented language models based on
of the masking ratio, i.e., γde = 0.9, does not masked auto-encoding: the input sentence is ran-
bring in additional improvements. It is proba- domly masked twice for encoder and decoder, re-
bly because an over aggressive masking ratio will spectively; the sentence embedding from encoder
discard too much necessary information to recon- is joined with the masked input of decoder to re-
struct the original sentence. construct the original sentence. We take advantage
 of asymmetric model structure (full-scale encoder
 Lastly, we also make analysis for the encoder’s and simplified decoder) and asymmetric masking
masking ratio (γen ). It is quite interesting that a ratio (a moderate one for encoder and an aggres-
slightly improved masking ratio, i.e., γen = 0.3, sive one for decoder), which improves the dif-
achieves better performances than the default one ficulty of reconstruction. We also introduce the
γen = 0.15. Similar as the decoder’s situation, an enhanced decoding with two-stream attention and
increased masking ratio on the encoder side will position-specific attention mask, which increases
also increase the reconstruction difficulty. How- the data efficiency of pre-training. The experimen-
ever, the empirical performance will not benefit tal studies on MS MARCO and BEIR benchmark
from an even larger masking ratio; and the ideal validate RetroMAE’s effectiveness, as significant
value of γen is smaller than γde . This is because a improvements on retrieval quality can be achieved
too large γen will prevent the generation of high- against the existing methods.
quality sentence embedding, considering that too
much useful information about the input sentence
will be discarded.
