2026 · Learning-to-rank retrieval benchmark
hybrid-search-bench
BM25, SPLADE, and dense retrieval fused and reranked by a LambdaMART learning-to-rank model, measured honestly on a public BEIR dataset.
Why I built this
Most retrieval portfolios stop at 'I called an embedding model.' I wanted the honest version of a search stack: three different retrievers measured on the same public qrels, a genuine learning-to-rank reranker on top, and every number anchored against published BEIR figures so the gains are credible rather than cherry-picked.
Architecture
Three retrieval legs, one reranker
- BM25 · lexical baseline via bm25s
- SPLADE · learned-sparse retrieval
- Dense · bi-encoder over FAISS
- Fusion + rerank · reciprocal-rank fusion, then a LightGBM LambdaMART reranker over per-leg scores, ranks, and agreement features
Tech stack
Technologies used
core
infra
tools
Key highlights
Proof points
- 01
Three legs on the same BEIR qrels: BM25 (bm25s), SPLADE learned-sparse (sentence-transformers), and a dense bi-encoder over FAISS.
- 02
Reciprocal-rank-fusion baseline, then a LightGBM LambdaMART reranker over the fused candidates (per-leg scores, ranks, and agreement features).
- 03
BEIR SciFact: LambdaMART nDCG@10 0.778 versus 0.728 for RRF fusion, a 6.9 percent lift; MRR 0.761, a 9.1 percent lift.
- 04
BM25 at nDCG@10 0.686 matches the published BEIR figure, anchoring the pipeline as correct rather than flattering.
Benchmark results
nDCG@10 (LambdaMART)
vs 0.728 RRF, +6.9%
MRR (LambdaMART)
+9.1% over RRF
BM25 nDCG@10
matches published BEIR
Focus areas
Explore the work