Neural Lexical Search with Learned Sparse Retrieval

1University of Amsterdam, 2Cohere, 3University of Glasgow, 4ISTI-CNR, 5Johns Hopkins University

Abstract

Learned Sparse Retrieval (LSR) techniques use neural machinery to represent queries and documents as learned bags of words. In contrast with other neural retrieval techniques, such as generative retrieval and dense retrieval, LSR has been shown to be a remarkably robust, transferable, and efficient family of methods for retrieving high-quality search results. This half-day tutorial aims to provide an extensive overview of LSR, ranging from its fundamentals to the latest emerging techniques. By the end of the tutorial, attendees will be familiar with the important design decisions of an LSR system, know how to apply them to text and other modalities, and understand the latest techniques for retrieving with them efficiently. Website: https://lsr-tutorial.github.io

Tutorial Topics and Slides

Part 1: Fundamentals

Section Duration
Introduction to LSR 20 min
Datasets and Evaluation 10 min
LSR Framework 30 min
Text LSR (Colab) 30 min

Part 2: Emerging Topics

Section Duration
Multilingual LSR 20 min
Multimodal LSR (Colab) 20 min
Indexing & Efficiency 30 min
Hybrid Dense-Sparse Retrieval 15 min
Conclusion 5 min

References

Part 1 – Fundamentals

1. Introduction to LSR

  • Fabio Crestani, Mounia Lalmas, Cornelis J. Van Rijsbergen, and Iain Campbell. 1998. “Is this document relevant?… probably”: a survey of probabilistic models in information retrieval. ACM Comput. Surv. 30, 4 (1998), 528–552.
  • W. Bruce Croft. 1981. Document representation in probabilistic models of information retrieval. Journal of the American Society for Information Science 32, 6 (1981), 451–457.
  • W. Bruce Croft and David J. Harper. 1979. Probabilistic models of document retrieval with relevance information. Journal of Documentation 35, 4 (1979), 285–295.
  • Miles Efron, Peter Organisciak, and Katrina Fenlon. 2012. Improving retrieval of short texts through document expansion (SIGIR ’12). 911–920.
  • Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). Association for Computing Machinery, 2288–2292.
  • G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. 1987. The vocabulary problem in human-system communication. Commun. ACM 30, 11 (1987), 964–971.
  • Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with Contextualization. In SIGIR 2020.
  • Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
  • Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at TREC-3. NIST Special Publication SP 109 (1995), 109.
  • Gerard Salton and Chris Buckley. 1988. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 24 (1988), 513–523.
  • G. Salton, A. Wong, and C. S. Yang. 1975. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
  • Tefko Saracevic. 2016. The Notion of Relevance in Information Science. In Morgan & Claypool Publishers.
  • Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM CIKM Conference on Information and Knowledge Management (CIKM ’18). Association for Computing Machinery, 497–506.
  • Shengyao Zhuang and Guido Zuccon. 2021. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. arXiv preprint arXiv:2108.08513 (2021).

2. Datasets and Evaluation

  • Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. 2018. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2018).
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. In TREC.
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020). https://arxiv.org/abs/2003.07820
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2021. Overview of the TREC 2020 deep learning track. arXiv preprint arXiv:2102.07662 (2021). https://arxiv.org/abs/2102.07662
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2023. Overview of the TREC 2023 Deep Learning Track. In TREC.
  • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin. 2021. Overview of the TREC 2021 Deep Learning Track. In TREC.
  • Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv preprint arXiv:2109.10086 (2021).
  • Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).

3. LSR Framework

  • Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. A Unified Framework for Learned Sparse Retrieval. In ECIR ’23. Springer-Verlag, 101–116.

4. Text LSR

  • Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, and Jaap Kamps. 2018. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM ’18). Association for Computing Machinery, 497–506.
  • Antonio Mallia, Omar Khattab, Torsten Suel, and Nicola Tonellotto. 2021. Learning Passage Impacts for Inverted Indexes. In SIGIR ’21.
  • Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, and Ophir Frieder. 2020. Expansion via Prediction of Importance with Contextualization. In SIGIR 2020.
  • Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). Association for Computing Machinery, 2288–2292.
  • Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv preprint arXiv:2106.14807 (2021).
  • V Sanh. 2019. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108 (2019).
  • Jacob Devlin. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained Transformers for Text Ranking: BERT and Beyond. ArXiv abs/2010.06467 (2020). https://arxiv.org/abs/2010.06467
  • Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
  • Thibault Formal, Carlos Lassance, Benjamin Piwowarski, and Stéphane Clinchant. 2021. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. arXiv preprint arXiv:2109.10086 (2021).
  • Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv preprint arXiv:2403.06789 (2024).
  • Zhichao Geng, Dongyu Ru, and Yang Yang. 2024. Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers. arXiv preprint arXiv:2411.04403 (2024).
  • Franco Maria Nardini, Thong Nguyen, Cosimo Rulli, Rossano Venturini, and Andrew Yates. 2025. Effective Inference-Free Retrieval for Learned Sparse Representations. In SIGIR 2025.
  • Emmanouil Georgios Lionis and Jia-Huei Ju. 2025. On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents. In ECIR ’25. 64–78.
  • Thong Nguyen, Sean MacAvaney, and Andrew Yates. 2023. Adapting Learned Sparse Retrieval for Long Documents (SIGIR ’23). 1781–1785.
  • Meet Doshi, Vishwajeet Kumar, Rudra Murthy, Jaydeep Sen, et al. 2024. Mistral-SPLADE: LLMs for better learned sparse retrieval. arXiv preprint arXiv:2408.11119 (2024).
  • Yibin Lei, Tao Shen, Yu Cao, and Andrew Yates. 2025. Enhancing Lexicon-Based Text Embeddings with Large Language Models. arXiv preprint arXiv:2501.09749 (2025).
  • Jingfen Qiao, Thong Nguyen, Evangelos Kanoulas, and Andrew Yates. 2025. Leveraging Decoder Architectures for Learned Sparse Retrieval. arXiv preprint arXiv:2504.18151 (2025).
  • Zhichao Xu, Aosong Feng, Yijun Tian, Haibo Ding, and Lin Leee Cheong. 2025. CSPLADE: Learned Sparse Retrieval with Causal Language Models. arXiv preprint arXiv:2504.10816 (2025).
  • Hansi Zeng, Julian Killingback, and Hamed Zamani. 2025. Scaling Sparse and Dense Retrieval in Decoder-Only LLMs. arXiv preprint arXiv:2502.15526 (2025).
  • Jeffrey M Dudek, Weize Kong, Cheng Li, Mingyang Zhang, and Michael Bendersky. 2023. Learning Sparse Lexical Representations Over Specified Vocabularies for Retrieval. In CIKM ’23. 3865–3869.
  • Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, and Andrew Yates. 2024. DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities. arXiv preprint arXiv:2410.07722 (2024).
  • Puxuan Yu, Antonio Mallia, and Matthias Petri. 2024. Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies. In ECIR ’24. 181–194.

5. Training LSR Models

  • Zhichao Geng, Dongyu Ru, and Yang Yang. 2024. Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers. arXiv preprint arXiv:2411.04403 (2024).
  • Carlos Lassance and Stéphane Clinchant. 2022. An Efficiency Study for SPLADE Models. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). Association for Computing Machinery, 2220–2226.
  • Carlos Lassance, Hervé Dejean, and Stéphane Clinchant. 2023. An Experimental Study on Pretraining Transformers from Scratch for IR. In 45th European Conference on Information Retrieval (ECIR ’23). Springer-Verlag, 504–520.
  • Suraj Nair, Eugene Yang, Dawn Lawrie, James Mayfield, and Douglas W. Oard. 2023. BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, 1219–1229.
  • Suraj Nair, Eugene Yang, Dawn J. Lawrie, James Mayfield, and Douglas W. Oard. 2022. Learning a Sparse Representation Model for Neural CLIR. In Biennial Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2022). 53–64.
  • Aldo Porco, Dhruv Mehra, Igor Malioutov, Karthik Radhakrishnan, Moniba Keymanesh, Daniel Preotiuc-Pietro, Sean MacAvaney, and Pengxiang Cheng. 2025. An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-doc. In SIGIR ’25.

Part 2 – Emerging Topics

1. Multilingual LSR

  • Suraj Nair, Eugene Yang, Dawn Lawrie, James Mayfield, and Douglas W. Oard. 2023. BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, 1219–1229.
  • Suraj Nair, Eugene Yang, Dawn J. Lawrie, James Mayfield, and Douglas W. Oard. 2022. Learning a Sparse Representation Model for Neural CLIR. In Biennial Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES 2022). 53–64.

2. Multimodal LSR

  • Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Albin Jose, Alexander Toshev, Yantao Zheng, Jonathon Shlens, Ruoming Pang, and Yinfei Yang. 2023. STAIR: Learning Sparse Text and Image Representation in Grounded Tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, 15079–15094.
  • Ziyang Luo, Pu Zhao, Can Xu, Xiubo Geng, Tao Shen, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 11206–11217.
  • Thong Nguyen, Mariya Hendriksen, Andrew Yates, and Maarten de Rijke. 2024. Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control. In 46th European Conference on Information Retrieval (ECIR ’24). Springer-Verlag, 448–464.
  • Jiawei Zhou, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, and Lei Chen. 2024. Retrieval-based Disentangled Representation Learning with Natural Language Supervision. In The Twelfth International Conference on Learning Representations.

3. Indexing & Efficient LSR

  • Sebastian Bruch, Franco Maria Nardini, Amir Ingber, and Edo Liberty. 2023. An approximate algorithm for maximum inner product search over streaming sparse vectors. ACM Transactions on Information Systems 42, 2 (2023), 1–43.
  • Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. 2024. Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 152–162.
  • Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, and Rossano Venturini. 2024. Pairing Clustered Inverted Indexes with kNN Graphs for Fast Approximate Retrieval over Learned Sparse Representations. arXiv preprint arXiv:2408.04443 (2024).
  • Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, and Leonardo Venuta. 2025. Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets. arXiv preprint arXiv:2501.11628 (2025).
  • Parker Carlson, Wentai Xie, Shanxiu He, and Tao Yang. 2025. Dynamic Superblock Pruning for Fast Learned Sparse Retrieval. arXiv preprint arXiv:2504.17045 (2025).
  • Constantinos Dimopoulos, Sergey Nepomnyachiy, and Torsten Suel. 2013. Optimizing top-k document retrieval strategies for block-max indexes. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. 113–122.
  • Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 993–1002.
  • Carlos Lassance, Simon Lupart, Hervé Déjean, Stéphane Clinchant, and Nicola Tonellotto. 2023. A Static Pruning Study on Sparse Neural Retrievers. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). 1771–1775.
  • Joel Mackenzie, Andrew Trotman, and Jimmy Lin. 2023. Efficient document-at-a-time and score-at-a-time query evaluation for learned sparse representations. ACM Transactions on Information Systems 41, 4 (2023), 1–28.
  • Joel Mackenzie, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, and Rossano Venturini. 2017. Faster BlockMax WAND with variable-sized blocks. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 625–634.
  • Antonio Mallia and Elia Porciani. 2019. Faster BlockMax WAND with longer skipping. In 41st European Conference on IR Research (ECIR ’19). Springer, 771–778.
  • Antonio Mallia, Torsten Suel, and Nicola Tonellotto. 2024. Faster learned sparse retrieval with block-max pruning. In SIGIR ’24.
  • Yifan Qiao, Parker Carlson, Shanxiu He, Yingrui Yang, and Tao Yang. 2024. Threshold-driven Pruning with Segmented Maximum Term Weights for Approximate Cluster-based Sparse Retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 19742–19757.
  • Yifan Qiao, Yingrui Yang, Shanxiu He, and Tao Yang. 2023. Representation sparsification with hybrid thresholding for fast SPLADE-based document retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2329–2333.
  • Yifan Qiao, Yingrui Yang, Haixin Lin, and Tao Yang. 2023. Optimizing guided traversal for fast learned sparse retrieval. In Proceedings of the ACM Web Conference 2023. 3375–3385.

4. Hybrid Dense-Sparse Retrieval

  • Xilun Chen, Kushal Lakhotia, Barlas Oguz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen-tau Yih. 2022. Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 250–262.
  • Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv preprint arXiv:2106.14807 (2021).
  • Sheng-Chieh Lin and Jimmy Lin. 2023. A Dense Representation Framework for Lexical and Semantic Matching. ACM Transactions on Information Systems (2023).
  • Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Guodong Long, Kai Zhang, and Daxin Jiang. 2023. UnifieR: A Unified Retriever for Large-Scale Retrieval. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23). 4787–4799.
  • Yingrui Yang, Parker Carlson, Shanxiu He, Yifan Qiao, and Tao Yang. 2024. Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24). 2327–2331.
  • Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, and Guido Zuccon. 2024. PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval. arXiv preprint arXiv:2404.18424 (2024).