2023 UnifierAUnifiedRetrieverforLarg

From GM-RKB
Jump to navigation Jump to search

Subject Headings: RAG Algorithm.

Notes

Cited By

Quotes

Abstract

Large-scale retrieval is to recall relevant documents from a huge collection given a query. It relies on representation learning to embed documents and queries into a common semantic encoding space. According to the encoding space, recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms. These two paradigms unveil the PLMs' representation capability in different granularities, i.e., global sequence-level compression and local word-level contexts, respectively. Inspired by their complementary global-local contextualization and distinct representing views, we propose a new learning framework, Unifier, which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability. Experiments on passage retrieval benchmarks verify its effectiveness in both paradigms. A uni-retrieval scheme is further presented with even better retrieval quality. We lastly evaluate the model on BEIR benchmark to verify its transferability.

KEYWORDS

deep representation learning; pre-trained language model; neural encoder; hybrid retrieval

1 INTRODUCTION

Large-scale retrieval aims to efficiently fetch all relevant documents for a given query from a large-scale collection with millions or billions of entries [1]. It plays indispensable roles as a prerequisite for a broad spectrum of downstream tasks, e.g., information retrieval [2], open-domain question answering [3]. To make online large-scale retrieval possible, the common practice is to represent queries and documents by an encoder in a Siamese manner (i.e., Bi-Encoder, BE) [39]. So, its success depends heavily on a powerful encoder by effective representation learning.

Advanced by pre-trained language models (PLM), e.g., BERT [9], recent works propose to learn PLM-based encoders for large-scale retrieval, which are coarsely grouped into two paradigms in light of their encoding spaces with different focuses of representation granularity. That is, dense-vector encoding methods leverage sequence-level compressive representations that embedded into dense semantic space [14, 24, 51, 54], whereas lexicon-based encoding methods make the best of word-level contextual representations by considering either high concurrence [36] or coordinate terms [12] in PLMs. To gather the powers of both worlds, some pioneering works pro- pose hybrid methods to achieve a sweet point between dense-vector and lexicon-based methods for better retrieval quality. They focus on interactions of predicted scores between the two paradigms.

Nonetheless, such surface interactions – score aggregations [25], direct co-training [16], and logits distillations [5] – cannot fully exploit the benefits of the two paradigms – regardless of their complementary contextual features and distinct representation views. Specifically, as for contextual features, the dense-vector models focus more on sequence-level global embeddings against information bottleneck [13, 14, 31], whereas the lexicon-based models focus on word-level local contextual embeddings for precise lexicon-weighting [10, 11, 36]. Aligning the two retrieval paradigms more closely is likely to benefit each other since global-local contexts are proven complementary in general representation learning [1, 44]. As for representing views, relying on distinct encoding spaces, the two retrieval paradigms are proven to provide different views in terms of query-document relevance [15, 16, 25]. Such a sort of ‘dual views’ has been proven pivotal in many previous cooperative learning works [4, 17, 18, 27], which provides a great opportunity to bridge the two retrieval paradigms. Consequently, without any in-depth interactions, neither the single (dense/lexicon) nor the hybrid retrieval model can be optimal.

Motivated by the above, we propose a brand-new learning framework, Unified Retriever (Unifier), for in-depth mutual benefits of both dense-vector and lexicon-based retrieval. On the one hand, we present a neural encoder with dual representing modules for Unifier, which is compatible with both retrieval paradigms. Built upon an underlying-tied contextualization that empowers consistent semantics sharing, a local-enhanced sequence representation module is presented to learn a dense-vector representation model. Meantime, a global-aware lexicon weighting module considering both the global- and local-context is proposed for a lexicon-based representation. On the other hand, we propose a new self-learning strategy, called dual-consistency learning, upon our unified encoder. Besides a basic contrastive learning objective, we first exploit the unified dual representing modules by mining diverse hard negatives for self-adversarial within the Unifier. Furthermore, we present a self-regularization method based on list-wise agreements from the dual views for better consistency and generalization.

After being trained, Unifier performs large-scale retrieval via either its lexicon representation by efficient inverted index or dense vectors by parallelizable dot-product. Moreover, empowered by our Unifier, we present a fast yet effective retrieval scheme, uni-retrieval, to gather the powers of both worlds, where the lexicon retrieval is followed by a candidate-constrained dense scoring. Empirically, we evaluate Unifier on both passage retrieval benchmarks to check its effectiveness and the BEIR benchmark [48] with twelve datasets (e.g., Natural Questions, HotpotQA) to verify its transferability.

2 RELATED WORK

PLM-based Retriever. Built upon PLMs, recent works propose to learn encoders for large-scale retrieval, which are coarsely grouped into two paradigms in light of their encoding spaces with different focuses of representation granularity: (i) Dense-vector encoding methods directly represent a document/query as a low-dimension sequence-level dense vector 𝒖 ∈ R𝑒 (𝑒 is embedding size and usually small, e.g., 768). And the relevance score between a document and a query is calculated by dot-product or cosine similarity [14, 24, 51, 54]. (ii) Lexicon-based encoding methods make the best of word-level contextualization by considering either high concurence [36] or coordinate terms [12] in PLMs. It first weights all vocabulary lexicons for each word of a document/query based on the contexts, leading to a high-dimension sparse vector 𝒗 ∈ R|V| (|V| is the vocabulary size and usually large, e.g., 30k). The text is then denoted by aggregating over all the lexicons in a sparse manner. Lastly, the relevance is calculated by lexical-based matching metrics (e.g., BM25 [42]). In contrast, we unify the two paradigms into one sophisticated encoder for better consistency within PLMs, leading to complementary information and superior performance.

Hybrid Retriever. Some works propose to bridge the gap between dense and lexicon for a sweet spot between performance and efficiency. A direct method is to aggregate scores of the two paradigms [25], but resulting in standalone learning and sub-optimal quality. Similar to our work, CLEAR [16] uses a dense-vector model to complement the lexicon-based BM25 model, but without feature interactions and sophisticated learning. Sharing inspiration with our uni-retrieval scheme, COIL [15] equips a simple lexicon-based retrieval with dense operations over word-level contextual embeddings. Unifier differs in not only our lexicon representations jointly learned for in-depth mutual benefits but also sequence-level dense operations involved for memory-/computation-efficiency. Lastly, SPARC [26] distills ranking orders from a lexicon model (BM25) into a dense model as a companion of the original dense vector, which is distinct to our motivation.

Bottleneck-based Learning. For neural designs,our encoder is similar to several recent representation learning works, e.g., SEED- Encoder [31]. Condenser [13], coCondenser [14], and DiffCSE [6], but they focus on the bottleneck of sequence-level dense vectors. For example, SEED-Encoder, Condenser, and CoCondenser enhance their dense capabilities by emphasizing the sequence-level bot- tleneck vector and weakening the word-level language modeling heads, while DiffCSE makes the learned sentence embedding sensi- tive to the difference between the original sentence and an edited sentence by a word-level discriminator. With distinct motivations and targets, we fully exploit both the dense-vector bottleneck and the word-level representation learning in a PLM for their mutual benefits. These are on the basis of not only the shared neural mod- ules but also structure-facilitated self-learning strategies (see the next section). But, as discussed in our experiments, our model can still benefit from these prior works via parameter initializations.

Instance-dependent Prompt. Our model also shares high-level inspiration with recent instance-dependent prompt learning meth- ods [22, 50]. They introduce a trainable component to generate prompts based on each input example. Such generated prompts can provide complementary features to the original input for a better prediction quality. Analogously, our sequence-level dense vector can be seen as a sort of ‘soft-prompt’ for the sparse lexicon-based representation module, resulting in the superiority of our lexicon- based retrieval, which will be discussed in experiments. In addition, the ‘soft-prompt’ in our Unifier also serves as crucial outputs in a unified retrieval system.

Reranker-taught Retriever. Distilling the scores from a reranker into a retriever is proven promising [10, 19, 20] . In light of this, recent works propose to jointly optimize a retriever and a reranker: RocketQAv2 [41] is proposed to achieve their agreements with reranker-filtered hard negatives, while AR2 [56] is to learn them in an adversarial fashion where the retriever is regarded as a generator and the reranker as a discriminator. In contrast to reranker-retriever co-training, we resort to in-depth sharing from the bottom (i.e., features) to the top (i.e., self-learning) merely within a retriever, with no need for extra overheads of reranker training. Meantime, our unified structure also uniquely enables it to learn from more diverse hard negatives mined by its dual representing modules.

3 METHODOLOGY

...

References

  1. (Beltagy et al., 2020) ⇒ Iz Beltagy, Matthew E. Peters, and Arman Cohan. (2020). “Longformer: The Long-Document Transformer.” In: arXiv preprint arXiv:2004.05150. \DOI:2004.05150\
  2. (Cai et al., 2021) ⇒ Yinqiong Cai, Yixing Fan, Jiafeng Guo, Fei Sun, Ruqing Zhang, and Xueqi Cheng. (2021). “Semantic Models for the First-stage Retrieval: A Comprehensive Review.” In: arXiv preprint arXiv:2103.04831. \[arXiv]2103.04831
  3. (Chen et al., 2017) ⇒ Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. (2017). “Reading Wikipedia to Answer Open-Domain Questions.” In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. \[1]
  4. (Chen et al., 2021b) ⇒ Jiaao Chen, Dinghan Shen, Weizhu Chen, and Diyi Yang. (2021). “HiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalizability.” In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. \[2]
  5. (Chen et al., 2021a) ⇒ Xilun Chen, Kushal Lakhotia, Barlas Oguz, Anchit Gupta, Patrick S. H. Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen-tau Yih. (2021). “Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?” In: arXiv preprint arXiv:2110.06918. \[arXiv]2110.06918
  6. (Chuang et al., 2022) ⇒ Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo, Yang Zhang, Shiyu Chang, Marin Soljacic, Shang-Wen Li, Wen-tau Yih, Yoon Kim, and James R. Glass. (2022). “DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings.” In: arXiv preprint arXiv:2204.10298. \[arXiv]2204.10298
  7. (Craswell et al., 2020) ⇒ Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. (2020). “Overview of the TREC 2019 deep learning track.” In: arXiv preprint arXiv:2003.07820. \[arXiv]2003.07820
  8. (Dai & Callan, 2019) ⇒ Zhuyun Dai and Jamie Callan. (2019). “Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval.” In: arXiv preprint arXiv:1910.10687. \[arXiv]1910.10687
  9. (Devlin et al., 2019) ⇒ Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. \[3]
  10. (Formal et al., 2021a) ⇒ Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. (2021). “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.” In: arXiv preprint arXiv:2109.10086. \[arXiv]2109.10086
  11. (Formal et al., 2022) ⇒ Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. (2022). “From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective.” In: arXiv preprint arXiv:2205.04733. \[arXiv]2205.04733
  12. (Formal et al., 2021b) ⇒ Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. (2021). “SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking.” In: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. \[4]
  13. (Gao & Callan, 2021a) ⇒ Luyu Gao and Jamie Callan. (2021). “Condenser: a Pre-training Architecture for Dense Retrieval.” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. \[5]
  14. (Gao & Callan, 2021b) ⇒ Luyu Gao and Jamie Callan. (2021). “Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval.” In: arXiv preprint arXiv:2108.05540. \[arXiv]2108.05540
  15. (Gao et al., 2021a) ⇒ Luyu Gao, Zhuyun Dai, and Jamie Callan. (2021). “COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List.” In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. \[6]
  16. (Gao et al., 2021b) ⇒ Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan. (2021). “Complement Lexical Retrieval Model with Semantic Residual Embeddings.” In: Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021. \[7]
  17. (Gao et al., 2021c) ⇒ Tianyu Gao, Xingcheng Yao, and Danqi Chen. (2021). “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. \[8]
  18. (Han et al., 2018) ⇒ Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. (2018). “Co-teaching: Robust training of deep neural networks with extremely noisy labels.” In: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018. \[9]
  19. (Hofstätter et al., 2020) ⇒ Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Hanbury. (2020). “Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation.” In: arXiv preprint arXiv:2010.02666. \[arXiv]2010.02666
  20. (Hofstätter et al., 2021) ⇒ Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. (2021). “Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling.” In: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. \[10]
  21. (Gautier et al., 2021) ⇒ Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. (2021). “Towards Unsupervised Dense Information Retrieval with Contrastive Learning.” In: arXiv preprint arXiv:2112.09118. \[arXiv]2112.09118
  22. (Jin et al., 2022) ⇒ Feihu Jin, Jinliang Lu, Jiajun Zhang, and Chengqing Zong. (2022). “Instance-aware Prompt Learning for Language Understanding and Generation.” In: arXiv preprint arXiv:2201.07126. \[arXiv]2201.07126
  23. (Karpukhin et al., 2020) ⇒ Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. (2020). “Dense Passage Retrieval for Open-Domain Question Answering.” In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. \[11]
  24. (Khattab & Zaharia, 2020) ⇒ Omar Khattab and Matei Zaharia. (2020). “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.” In: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. \[12]
  25. (Kuzi et al., 2020) ⇒ Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Bendersky, and Marc Najork. (2020). “Leveraging Semantic and Lexical Matching to Improve the Recall of Document Retrieval Systems: A Hybrid Approach.” In: arXiv preprint arXiv:2010.01195. \[arXiv]2010.01195
  26. (Lee et al., 2020) ⇒ Jinhyuk Lee, Min Joon Seo, Hannaneh Hajishirzi, and Jaewoo Kang. (2020). “Contextualized Sparse Representations for Real-Time Open-Domain Question Answering.” In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. \[13]
  27. (Lin et al., 2021) ⇒ Xiaobo Liang, Lijun Wu, Juntao Li, Yue Wang, Qi Meng, Tao Qin, Wei Chen, Min Zhang, and Tie-Yan Liu. (2021). “R-Drop: Regularized Dropout for Neural Networks.” In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021. \[14]
  28. (Lin & Ma, 2021) ⇒ Jimmy Lin and Xueguang Ma. (2021). “A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.” In: arXiv preprint arXiv:2106.14807. \[arXiv]2106.14807
  29. (Lin et al., 2021) ⇒ Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. (2021). “In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval.” In: Proceedings of the 6th Workshop on Representation Learning for NLP. \[15]
  30. (Liu et al., 2019) ⇒ Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. (2019). “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” In: arXiv preprint arXiv:1907.11692. \[arXiv]1907.11692
  31. (Lu et al., 2021) ⇒ Shuqi Lu, Chenyan Xiong, Di He, Guolin Ke, Waleed Malik, Zhicheng Dou, Paul Bennett, Tie-Yan Liu, and Arnold Overwijk. (2021). “Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder.” In: arXiv preprint arXiv:2102.09206. \[arXiv]2102.09206
  32. (Menon et al., 2021) ⇒ Aditya Krishna Menon, Sadeep Jayasumana, Seungyeon Kim, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. (2021). “In defense of dual-encoders for neural ranking.”
  33. (Muennighoff, 2022) ⇒ Niklas Muennighoff. (2022). “SGPT: GPT Sentence Embeddings for Semantic Search.” In: arXiv preprint arXiv:2202.08904. \[arXiv]2202.08904
  34. (Nguyen et al., 2016) ⇒ Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. (2016). “MS MARCO: A Human Generated MAchine Reading COmprehension Dataset.” In: Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016. http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf
  35. (Ni et al., 2021) ⇒ Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. (2021). “Large Dual Encoders Are Generalizable Retrievers.” In: arXiv preprint arXiv:2112.07899. \[arXiv]2112.07899
  36. (Nogueira et al., 2019) ⇒ Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. (2019). “From doc2query to docTTTTTquery.”
  37. (Paria et al., 2020) ⇒ Biswajit Paria, Chih-Kuan Yeh, Ian En-Hsu Yen, Ning Xu, Pradeep Ravikumar, and Barnabás Póczos. (2020). “Minimizing FLOPs to Learn Efficient Sparse Representations.” In: 8th International Conference on Learning Representations. \[16]
  38. (Qu et al., 2021) ⇒ Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. (2021). “RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering.” In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. \[17]
  39. (Reimers & Gurevych, 2019) ⇒ Nils Reimers and Iryna Gurevych. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. \[18]
  40. (Ren et al., 2021a) ⇒ Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. (2021). “PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval.” In: Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021. \[19]
  41. (Ren et al., 2021b) ⇒ Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. (2021). “RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking.” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. \[20]
  42. (Robertson & Zaragoza, 2009) ⇒ Stephen E. Robertson and Hugo Zaragoza. (2009). “The Probabilistic Relevance Framework: BM25 and Beyond.” In: Foundations and Trends in Information Retrieval. \[21]
  43. (Santhanam et al., 2021) ⇒ Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. (2021). “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” In: arXiv preprint arXiv:2112.01488. \[arXiv]2112.01488
  44. (Shen et al., 2019) ⇒ Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. (2019). “Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together.” In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. \[22]
  45. (Song et al., 2021) ⇒ Xinying Song, Alex Salcianu, Yang Song, Dave Dopson, and Denny Zhou. (2021). “Fast WordPiece Tokenization.” In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. \[23]
  46. (Sun et al., 2019) ⇒ Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. (2019). “RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.” In: 7th International Conference on Learning Representations. \[24]
  47. (Sutskever et al., 2014) ⇒ Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. (2014). “Sequence to Sequence Learning with Neural Networks.” In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014. \[25]
  48. (Thakur et al., 2021) ⇒ Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. (2021). “BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models.” In: arXiv preprint arXiv:2104.08663. \[arXiv]2104.08663
  49. (Vaswani et al., 2017) ⇒ Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is All you Need.” In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. \[26]
  50. (Wu et al., 2022) ⇒ Zhuofeng Wu, Sinong Wang, Jiatao Gu, Rui Hou, Yuxiao Dong, V. G. Vinod Vydiswaran, and Hao Ma. (2022). “IDPG: An Instance-Dependent Prompt Generation Method.” In: arXiv preprint arXiv:2204.04497. \[arXiv]2204.04497
  51. (Xiong et al., 2021) ⇒ Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. (2021). “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.” In: 9th International Conference on Learning Representations. \[27]
  52. (Yang, Ma et al., 2021) ⇒ Jheng-Hong Yang, Xueguang Ma, and Jimmy Lin. (2021). “Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking.” In: arXiv preprint arXiv:2112.09628. \[arXiv]2112.09628
  53. (Yang, Fang et al., 2017) ⇒ Peilin Yang, Hui Fang, and Jimmy Lin. (2017). “Anserini: Enabling the Use of Lucene for Information Retrieval Research.” In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. \[28]
  54. (Zhan et al., 2021) ⇒ Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. (2021). “Optimizing Dense Retrieval Model Training with Hard Negatives.” In: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. \[29]
  55. (Zhan et al., 2022) ⇒ Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. (2022). “Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval.” In: WSDM '22: The Fifteenth ACM International Conference on Web Search and Data Mining. \[30]
  56. (Zhang et al., 2022) ⇒ Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv, Nan Duan, and Weizhu Chen. (2022). “Adversarial Retriever-Ranker for Dense Text Retrieval.” In: International Conference on Learning Representations. \[31]
  57. (Zhou, Li et al., 2022) ⇒ Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Lan Luo, Ke Zhan, Enrui Hu, Xinyu Zhang, Hao Jiang, Zhao Cao, Fan Yu, Xin Jiang, Qun Liu, and Lei Chen. (2022). “Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering.” In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. \[32];


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 UnifierAUnifiedRetrieverforLargDaxin Jiang
Kai Zhang
Tao Shen
Xiubo Geng
Chongyang Tao
Can Xu
Guodong Long
Unifier: A Unified Retriever for Large-scale Retrieval10.1145/3580305.35999272023
  1. An entry can be passage, document, etc., and we take document for demonstrations.