2026 · Learning-to-rank retrieval benchmark

hybrid-search-bench

BM25, SPLADE, and dense retrieval fused and reranked by a LambdaMART learning-to-rank model, measured honestly on a public BEIR dataset.

Built 2026

Why I built this

Most retrieval portfolios stop at 'I called an embedding model.' I wanted the honest version of a search stack: three different retrievers measured on the same public qrels, a genuine learning-to-rank reranker on top, and every number anchored against published BEIR figures so the gains are credible rather than cherry-picked.

0.778 nDCG@10 (LambdaMART)

+6.9% Lift over RRF fusion

4 Retrieval methods compared

Architecture

Three retrieval legs, one reranker

BM25 · lexical baseline via bm25s
SPLADE · learned-sparse retrieval
Dense · bi-encoder over FAISS
Fusion + rerank · reciprocal-rank fusion, then a LightGBM LambdaMART reranker over per-leg scores, ranks, and agreement features

Tech stack

Technologies used

core

Pythonbm25s (BM25)SPLADE (learned sparse)FAISS (dense)

infra

LightGBM (LambdaMART)ranx (fusion + eval)

tools

BEIR / ir_datasetsReciprocal Rank Fusion

Key highlights

Proof points

01
Three legs on the same BEIR qrels: BM25 (bm25s), SPLADE learned-sparse (sentence-transformers), and a dense bi-encoder over FAISS.
02
Reciprocal-rank-fusion baseline, then a LightGBM LambdaMART reranker over the fused candidates (per-leg scores, ranks, and agreement features).
03
BEIR SciFact: LambdaMART nDCG@10 0.778 versus 0.728 for RRF fusion, a 6.9 percent lift; MRR 0.761, a 9.1 percent lift.
04
BM25 at nDCG@10 0.686 matches the published BEIR figure, anchoring the pipeline as correct rather than flattering.

Benchmark results

0.778

nDCG@10 (LambdaMART)

vs 0.728 RRF, +6.9%

0.761

MRR (LambdaMART)

+9.1% over RRF

0.686

BM25 nDCG@10

matches published BEIR

Focus areas

Learning-to-rankLambdaMARTBM25SPLADEDense retrievalFAISSBEIRInformation retrieval

Explore the work

View on GitHub ← All projects