Hybrid search combines dense vector retrieval with sparse keyword retrieval to get the best of both approaches. In theory, it should outperform either method alone. In practice, it still misses obvious relevant passages with surprising regularity. The problem is usually in the fusion step, not in either retrieval method individually.
Analysis Briefing
- Topic: Hybrid search failure modes, score fusion, and retrieval pipeline diagnosis
- Analyst: Mike D (@MrComputerScience)
- Context: A structured investigation kicked off by Claude Sonnet 4.6
- Source: Pithy Cyborg | AI News Made Simple
- Key Question: If both dense and sparse retrieval are working, why does combining them sometimes perform worse than either alone?
How Hybrid Search Is Supposed to Work
Dense retrieval uses vector embeddings to find semantically similar documents. It handles synonym matching, paraphrase, and conceptually related content well. It fails when exact terminology matters, such as product codes, version numbers, or specific names.
Sparse retrieval (BM25) uses term frequency and inverse document frequency to find documents containing the query terms. It handles exact keyword matching well. It fails when the query uses different vocabulary than the documents.
Hybrid search runs both, then fuses the ranked lists into a single result. The most common fusion method is Reciprocal Rank Fusion (RRF), which assigns a score to each document based on its rank in each list and sums them. Documents that appear high in both lists score highest.
This is robust when both retrievers agree. It breaks when they disagree on the most relevant document.
Why Fusion Fails on Obvious Queries
Consider a query for “maximum file upload size.” The best passage in your corpus says “Files larger than 50MB are not supported.” Dense retrieval ranks this passage highly because the embedding captures the semantic relationship between “maximum upload size” and “files larger than supported limit.” BM25 ranks it poorly because the query terms “maximum,” “file,” “upload,” and “size” do not all appear in that passage.
RRF averages a high dense rank with a low BM25 rank. The result is a middling combined rank. Meanwhile, a less relevant passage that happens to contain all four query terms scores highly in BM25, adequately in dense, and lands at the top of the fused result. The best passage is not retrieved. The query was obvious. The failure was invisible.
Semantic search returning confidently wrong results covers the dense-only version of this failure. Hybrid search adds BM25 to the stack but does not eliminate the failure mode when BM25 and dense disagree on the correct answer.
The Fusion Weighting Problem and Its Fix
RRF treats both retrievers as equally informative. They are not. For a given query and corpus, one retriever will be more reliable than the other. The right approach is to tune the weighting between dense and sparse scores based on your specific query distribution.
For corpora with precise technical terminology, BM25 should be weighted higher because exact term matching matters more than semantic similarity. For corpora with varied language expressing similar concepts, dense retrieval should be weighted higher.
A reranker trained on your domain adds a third stage that resolves disagreements between retrievers. Cross-encoder rerankers evaluate query-passage relevance directly rather than comparing embedding distances, and they substantially outperform fusion-only approaches on queries where dense and sparse retrievers disagree.
What This Means For You
- Measure retrieval performance by retriever separately before blaming fusion. Run dense-only and BM25-only evaluations on your query set to understand where each retriever succeeds and fails before tuning the fusion step.
- Add a reranker after fusion for any production RAG application, because rerankers resolve the disagreements between retrievers that fusion cannot handle by construction.
- Weight your fusion toward the retriever that performs better on your specific corpus and query distribution, rather than using equal weights as a default, because equal weighting is a starting point, not an optimal configuration.
Enjoyed this? Subscribe for more clear thinking on AI:
- Pithy Cyborg | AI News Made Simple → AI news made simple without hype.
