References
[1] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong
Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir
Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS
MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv
preprint arXiv:1611.09268 (2018).
[2] Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2023. An
approximate algorithm for maximum inner product search over streaming sparse
vectors. ACM Transactions on Information Systems 42, 2 (2023), 1–43.
[3] Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2024.
Bridging dense and sparse maximum inner product search. ACM Transactions on
Information Systems (2024).
[4] Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini.
2024. Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse
Representations. In Proceedings of the 47th International ACM SIGIR Conference
on Research and Development in Information Retrieval. 152–162.
SIGIR-AP ’24, December 9–12, 2024, Tokyo, Japan
[5] Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini.
2024. Pairing Clustered Inverted Indexes with kNN Graphs for Fast Approximate
Retrieval over Learned Sparse Representations. arXiv preprint arXiv:2408.04443
(2024).
[6] Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Albin Jose,
Alexander Toshev, Yantao Zheng, Jonathon Shlens, Ruoming Pang, and Yinfei
Yang. 2023. STAIR: Learning Sparse Text and Image Representation in Grounded
Tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natu-
ral Language Processing. Association for Computational Linguistics, Singapore,
15079–15094.
[7] Xilun Chen, Kushal Lakhotia, Barlas Oguz, Anchit Gupta, Patrick Lewis, Stan
Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen-tau Yih. 2022. Salient Phrase
Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?. In Findings
of the Association for Computational Linguistics: EMNLP 2022. Association for
Computational Linguistics, Abu Dhabi, United Arab Emirates, 250–262.
[8] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M.
Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning
Track. In TREC.
[9] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M.
Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv preprint
arXiv:2003.07820 (2020). https://arxiv.org/abs/2003.07820
[10] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M.
Voorhees. 2021. Overview of the TREC 2020 deep learning track. arXiv preprint
arXiv:2102.07662 (2021). https://arxiv.org/abs/2102.07662
[11] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Cam-
pos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2023. Overview of the
TREC 2023 Deep Learning Track. In TREC.
[12] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin.
2021. Overview of the TREC 2021 Deep Learning Track. In TREC.
[13] Fabio Crestani, Mounia Lalmas, Cornelis J. Van Rijsbergen, and Iain Campbell.
1998. “Is this document relevant?... probably”: a survey of probabilistic models
in information retrieval. ACM Comput. Surv. 30, 4 (1998), 528–552.
[14] W. Bruce Croft. 1981. Document representation in probabilistic models of infor-
mation retrieval. Journal of the American Society for Information Science 32, 6
(1981), 451–457.
[15] W. Bruce Croft and David J. Harper. 1979. Probabilistic models of document
retrieval with relevance information. Journal of Documentation 35, 4 (1979),
285–295.
[16] Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting
for Ad-Hoc Search. In Proceedings of The Web Conference 2020 (Taipei, Taiwan)
(WWW ’20). Association for Computing Machinery, 1897–1907.
[17] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805 (2018).
[18] Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. Opti-
mizing top-k document retrieval strategies for block-max indexes. In Proceedings
of the sixth ACM international conference on Web search and data mining. 113–122.
[19] Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-
max indexes. In Proceedings of the 34th international ACM SIGIR conference on
Research and development in Information Retrieval. 993–1002.
[20] Miles Efron, Peter Organisciak, and Katrina Fenlon. 2012. Improving retrieval of
short texts through document expansion (SIGIR ’12). Association for Computing
Machinery, 911–920.
[21] Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant.
2021. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval.
arXiv preprint arXiv:2109.10086 (2021).
[22] Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE:
Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings
of the 44th International ACM SIGIR Conference on Research and Development
in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for
Computing Machinery, 2288–2292.
[23] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. 1987. The vocabu-
lary problem in human-system communication. Commun. ACM 30, 11 (1987),
964–971.
[24] Zhichao Geng, Xinyuan Lu, Dagney Braun, Charlie Yang, and Fanit Kolchina. 2023.
Improving document retrieval with sparse semantic encoders. https://opensearch.
org/blog/improving-document-retrieval-with-sparse-semantic-encoders/
[25] Seraphina Goldfarb-Tarrant, Pedro Rodriguez, Jane Dwivedi-Yu, and Patrick
Lewis. 2024. MultiContrievers: Analysis of Dense Retrieval Representations.
arXiv preprint arXiv:2402.15925 (2024).
[26] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
[27] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey
Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-
Domain Question Answering. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP). 6769–6781.
[28] Carlos Lassance and Stéphane Clinchant. 2022. An Efficiency Study for SPLADE
Models. In Proceedings of the 45th International ACM SIGIR Conference in Information Retrieval
(Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, 2220–2226.
[29] Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and
Zhicheng Dou. 2024. From matching to generation: A survey on generative
information retrieval. arXiv preprint arXiv:2404.14851 (2024).
[30] Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and
a Conceptual Framework for Information Retrieval Techniques. arXiv preprint
arXiv:2106.14807 (2021).
[31] Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained Transformers
for Text Ranking: BERT and Beyond. ArXiv abs/2010.06467 (2020). https:
//arxiv.org/abs/2010.06467
[32] Jimmy Lin and Andrew Trotman. 2015. Anytime ranking for impact-ordered
indexes. In Proceedings of the 2015 International Conference on The Theory of
Information Retrieval. 301–304.
[33] Sheng-Chieh Lin and Jimmy Lin. 2023. A Dense Representation Framework for
Lexical and Semantic Matching. ACM Trans. Inf. Syst. (2023).
[34] Ziyang Luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma,
Qingwei Lin, and Daxin Jiang. 2023. LexLIP: Lexicon-Bottlenecked Language-
Image Pre-Training for Large-Scale Image-Text Sparse Retrieval. In Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV). 11206–11217.
[35] Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli
Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with
Contextualization. In Proceedings of the 43rd International ACM SIGIR Conference
on Research and Development in Information Retrieval (Virtual Event, China)
(SIGIR ’20). Association for Computing Machinery, 1573–1576.
[36] Joel Mackenzie, Matthias Petri, and Luke Gallagher. 2022. IOQP: A simple Impact-
Ordered Query Processor written in Rust. In Proceedings of the Third International
Conference on Design of Experimental Search & Information REtrieval Systems, San
Jose, CA, USA, August 30-31, 2022 (CEUR Workshop Proceedings, Vol. 3480), Omar
Alonso, Ricardo Baeza-Yates, Tracy Holloway King, and Gianmaria Silvello (Eds.).
CEUR-WS.org, 22–34. https://ceur-ws.org/Vol-3480/paper-03.pdf
[37] Yury Malkov, Alexander Ponomarenko, Andrey Logvinov, and Vladimir Krylov.
2014. Approximate nearest neighbor algorithm based on navigable small world
graphs. Inf. Syst. 45 (2014), 61–68. https://doi.org/10.1016/J.IS.2013.10.006
[38] Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learn-
ing Passage Impacts for Inverted Indexes. In Proceedings of the 44th Interna-
tional ACM SIGIR Conference on Research and Development in Information Re-
trieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery,
1723–1727.
[39] AntonioMallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano
Venturini. 2017. Faster BlockMax WAND with variable-sized blocks. In Proceed-
ings of the 40th international ACM SIGIR conference on research and development
in information retrieval. 625–634.
[40] Antonio Mallia and Elia Porciani. 2019. Faster BlockMax WAND with longer
skipping. In Advances in Information Retrieval: 41st European Conference on IR
Research, ECIR 2019, Cologne, Germany, April 14–18, 2019, Proceedings, Part I 41.
Springer, 771–778.
[41] Antonio Mallia, Torsten Suel, and Nicola Tonellotto. 2024. Faster learned sparse
retrieval with block-max pruning. In Proceedings of the 47th International ACM
SIGIR Conference on Research and Development in Information Retrieval. 2411–
2415.
[42] Suraj Nair, Eugene Yang, Dawn Lawrie, James Mayfield, and Douglas W. Oard.
2023. BLADE: Combining Vocabulary Pruning and Intermediate Pretraining
for Scaleable Neural CLIR. In Proceedings of the 46th International ACM SIGIR
Conference on Research and Development in Information Retrieval (Taipei, Taiwan)
(SIGIR ’23). Association for Computing Machinery, 1219–1229.
[43] Suraj Nair, Eugene Yang, Dawn J Lawrie, James Mayfield, and Douglas W. Oard.
2022. Learning a Sparse Representation Model for Neural CLIR. In Biennial
Conference on Design of Experimental Search & Information Retrieval Systems.
[44] Thong Nguyen, Mariya Hendriksen, Andrew Yates, and Maarten de Rijke. 2024.
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control. In
Advances in Information Retrieval: 46th European Conference on Information Re-
trieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part II (Glasgow,
United Kingdom). Springer-Verlag, 448–464.
[45] Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A Unified Framework
for Learned Sparse Retrieval. In Advances in Information Retrieval: 45th European
Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023,
Proceedings, Part III (Dublin, Ireland). Springer-Verlag, 101–116.
[46] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document
Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
[47] Ronak Pradeep, Kai Hui, Jai Gupta, Adam Lelkes, Honglei Zhuang, Jimmy Lin,
Donald Metzler, and Vinh Tran. 2023. How Does Generative Retrieval Scale to
Millions of Passages?. In Proceedings of the 2023 Conference on Empirical Methods
in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali
(Eds.). Association for Computational Linguistics, Singapore, 1305–1321. https:
//doi.org/10.18653/v1/2023.emnlp-main.83
[48] Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu,
Mike Gatford, et al. 1995. Okapi at TREC-3. Nist Special Publication Sp 109 (1995),
109.
[49] Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic
Text Retrieval. Inf. Process. Manag. 24 (1988), 513–523.
[50] G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic
indexing. Commun. ACM 18, 11 (1975), 613–620.
[51] V Sanh. 2019. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper
and Lighter. arXiv preprint arXiv:1910.01108 (2019).
[52] Tefko Saracevic. 2016. The Notion of Relevance in Information Science. In Morgan
& Claypool Publishers.
[53] Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang,
and Daxin Jiang. 2023. UnifieR: A Unified Retriever for Large-Scale Retrieval. In
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining (Long Beach, CA, USA) (KDD ’23). Association for Computing Machinery,
4787–4799.
[54] Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin
Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investi-
gating Large Language Models as Re-Ranking Agents. In Proceedings of the 2023
Conference on Empirical Methods in Natural Language Processing, Houda Bouamor,
Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Sin-
gapore, 14918–14937. https://doi.org/10.18653/v1/2023.emnlp-main.923
[55] Yubao Tang, Ruqing Zhang, Jiafeng Guo, and Maarten de Rijke. 2023. Recent
Advances in Generative Information Retrieval. In Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval in the
Asia Pacific Region, SIGIR-AP 2023, Beijing, China, November 26-28, 2023, Qingyao
Ai, Yiqin Liu, Alistair Moffat, Xuanjing Huang, Tetsuya Sakai, and Justin Zobel
(Eds.). ACM, 294–297. https://doi.org/10.1145/3624918.3629547
[56] Yi Tay, Vinh Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta,
Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, et al. 2022. Transformer memory as a
differentiable search index. Advances in Neural Information Processing Systems
35 (2022), 21831–21843.
[57] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of
information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
[58] Ellen M. Voorhees and Dawn M. Tice. 1999. The TREC-8 Question Answering
Track Report. In Proceedings of the Eighth Text REtrieval Conference (TREC-8).
[59] Ikuya Yamada, Akari Asai, and Hannaneh Hajishirzi. 2021. Efficient Passage
Retrieval with Hashing for Open-domain Question Answering. In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and
the 11th International Joint Conference on Natural Language Processing (Volume 2:
Short Papers), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.).
Association for Computational Linguistics, Online, 979–986.
[60] Yingrui Yang, Parker Carlson, Shanxiu He, Yifan Qiao, and Tao Yang. 2024.
Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval. In Pro-
ceedings of the 47th International ACM SIGIR Conference on Research and Develop-
ment in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for
Computing Machinery, 2327–2331.
[61] Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap
Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse
Representation for Inverted Indexing. In Proceedings of the 27th ACM International
Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18).
Association for Computing Machinery, 497–506.
[62] Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, and Lei Chen. 2024.
Retrieval-based Disentangled Representation Learning with Natural Language
Supervision. In The Twelfth International Conference on Learning Representations.
[63] Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuc-
con. 2024. PromptReps: Prompting Large Language Models to Generate Dense
and Sparse Representations for Zero-Shot Document Retrieval. arXiv preprint
arXiv:2404.18424 (2024).
[64] Shengyao Zhuang and Guido Zuccon. 2021. Fast Passage Re-ranking with Con-
textualized Exact Term Matching and Efficient Passage Expansion. arXiv preprint
arXiv:2108.08513 (2021).