How does an SEO listicle outrank my actual customers in my own vector search — permanently, no matter which database I use or how I tune k? In a word: vocabulary. Cosine similarity pays out on word overlap, and a page titled "12 Best Web Scraping Tools for Price Monitoring" uses every word my ideal customer uses.
One disclosure before anything else: the headline numbers are slots until I paste the readout from my eval harness — [BASELINE LEAKAGE %] → [POST-COMPILE LEAKAGE %] across [N EVAL QUERIES]. The weights and thresholds in this post are priors I'd defend in a design review, not benchmark results. The harness exists so your numbers replace mine. A retrieval post that hides that distinction is selling you something.
Why hard-negative leakage is the metric #
Reverse-engineer what each layer of a retrieval system actually optimizes, and the debugging map draws itself.
The embedding model optimizes exactly one thing: putting similar text near similar text. It does not reward intent, trust, or usefulness — it has no access to them. It faithfully preserves whatever meaning you hand it. So the system's reward function is set upstream, at index time, by the text you choose to embed. That's the lever. Everything downstream — hybrid retrieval, reranking, diversity — is working with whatever meaning you compiled in. Garbage in, geometry out.
The way I think about it: the LLM pass I'm going to describe is a compiler. Raw scraped evidence is source code. The semantic card it produces is object code, written deliberately for the target machine — the embedding model. The rest of this post is compiler engineering: the spec, the two compiler bugs that will bite you, and the test suite that tells you whether any of it worked.
And the economics are why this is newly viable. Index-time compute is cheap, parallel, and amortized — paid once per document, offline, no latency budget. Query-time compute is expensive, serial, and repeated, forever. Rough math for a 1M-document corpus at ~2K input / ~400 output tokens per document through a small model puts the whole compile pass in the low hundreds of dollars, one time — [YOUR ACTUAL ENRICHMENT BILL] goes here when you run it. Compare that to what teams happily spend on a reranker that runs on every single query. The cheapest place to add intelligence to a retrieval system is the index. Almost everyone adds it to the query instead, because the query side is where the demo lives.
scrape → COMPILE (LLM: evidence → semantic objects) → embed → retrieve → rerank → explain
My running example #
My system is a lead recommender built on scraped company pages. The goal is not "find similar pages." The goal is: surface companies that plausibly need product-page scraping, price monitoring, or data-enrichment workflows — ranked by strength of evidence.
The corpus is the usual scraped sludge: homepages, pricing pages, docs, job posts, blog content, directory listings. The villain is the affiliate listicle. In raw-text cosine space, "12 Best Web Scraping Tools for Price Monitoring" sits closer to my ideal customer than many actual customers do. The similarity is real — the text genuinely is similar. The meaning is wrong. No vector database setting fixes that.
Starting point / Results #
Don't take the method on faith. Here's the shape of the before/after, with the slots my harness fills:
Before (naive embed-everything, recipe v0, [DATE]):
- Corpus: [N DOCUMENTS]
- precision@5: [X]
- recall@50: [X]
- hard-negative leakage@10: [X%] — fraction of top-10 results matching the avoid-set
- Worst single query: [QUERY], with [N] listicles in the top 10
After (compile pass, recipe v1, [DATE]):
- precision@5: [X]
- recall@50: [X] ← guardrail; if this fell, the rest doesn't count
- hard-negative leakage@10: [X%]
- One-time compile cost: [INVOICE $]
One scoping note before the steps. Don't compile the whole corpus on day one. Eyeball your current top-10s across ten real queries and tally which junk class shows up most — for me it was tool listicles, and one class did most of the damage. Build the compiler against that class first, prove the leakage drop on that segment, then scale the recipe. Solve the structure narrow; scale the structure, not the effort.
Step 1 — Split every page into evidence, meaning, and trust #
- Store verbatim quotes and timestamps; never paraphrase them, never embed them.
- Embed only the LLM-written card — nothing else goes into the vector index.
- Keep source type, confidence, and crawl date in metadata, out of the embeddings.
A scraped page is not one object. It's three, and conflating them is the original sin of naive pipelines. Evidence is what the page literally said. Meaning is what my system believes the entity is — written by the compiler, designed for the embedding model, the only thing that gets embedded. Trust is how much I believe it and how fresh it is — it lives in metadata and decides what survives reranking.
Here's the shape, with an illustrative entity standing in until I paste a real one — [PASTE: one actual card from your index, evidence quotes intact, entity redacted]:
{
"entity_id": "co_8841",
"evidence": {
"source_url": "https://example.com/careers/data-ops-analyst",
"quotes": [
"You'll own our competitor price tracking pipeline across 40k SKUs",
"Experience with web scraping frameworks a plus"
],
"captured_at": "2026-06-04"
},
"card": {
"need": "Operates an internal competitor price-tracking pipeline at 40k SKU scale.",
"buyer": "Data operations team; actively hiring for this function.",
"trigger": "Job post implies the pipeline exists, is painful, and is under-staffed.",
"exclusions": "Not an agency, not a tooling vendor, not content about scraping."
},
"trust": {
"source_type": "job_post",
"evidence_count": 2,
"confidence": 0.84,
"last_seen": "2026-06-04"
}
}
Notice what the job post gave me that the homepage never would: this company doesn't sell scraping, it suffers scraping. That vendor-vs-sufferer distinction is invisible in raw-text similarity and is the entire business value of my recommender. The card makes it explicit.
Decision rule: when a recommendation looks wrong, you must be able to answer "scraper problem, compiler problem, or scoring problem?" with one query. If you can't, stop adding features and split the layers first. In a blob-embedding system those three failure classes are fused into one inscrutable cosine score.
Step 2 — Make the compiler write contrastively, or it will sabotage you #
- Ban the beige words inside the prompt itself.
- Force a
differs_fromfield naming the 2–3 nearest plausible confusions. - Allow
INSUFFICIENT; never let the model fill thin evidence.
Here's the dead end, and it's the most important part of this post. Prompt an LLM to "summarize this company" and it normalizes everything into the same competent, beige register. Every entity becomes "a platform that helps teams streamline workflows." You haven't compiled meaning — you've laundered it. The geometric consequence is brutal: when every card is written in the same template prose, pairwise similarities compress toward each other and your entities get harder to separate than the raw scrapes were. I've watched teams add a card pass, see retrieval quality drop, and conclude the whole approach is snake oil. The approach was fine. The cards were homogenized. [PASTE: two homogenized cards from your own v1 prompt, side by side — this artifact is worth more than any argument.]
The fix is a compiler whose job is not to describe the entity but to encode what separates it from its nearest confusions:
You are writing embedding material for a retrieval system, not a summary for a human.
Entity evidence: {quotes, fields, source_type}
Write a semantic card with these fields:
- need: the specific operational problem, in the buyer's vocabulary, not the vendor's
- buyer: who inside the company owns this pain
- trigger: the evidence that this need is active NOW (hiring, scale, pricing pressure)
- differs_from: name the 2-3 nearest things this could be confused with,
and state explicitly why this entity is not those things
- exclusions: categories this entity must never be retrieved alongside
Rules:
- Use concrete nouns from the evidence (SKU counts, tools named, team names).
- Ban these words: platform, solution, streamline, leverage, innovative, seamless.
- If the evidence is too thin to fill a field, write "INSUFFICIENT" — do not invent.
The differs_from field writes the decision boundary into the card itself — "looks like a scraping vendor because the job post mentions scraping frameworks; is actually a retailer that consumes scraping" is exactly the sentence that pushes this entity away from the vendor cluster. And INSUFFICIENT matters because confabulation is the silent miscompile: a model that invents "enterprise customers" onto a two-person startup has poisoned the index with confident fiction, and embedded fiction is indistinguishable from embedded fact forever after.
Decision rule: after any prompt change, pull 30 random card pairs. If mean pairwise similarity between cards is higher than between the raw chunks they came from, roll the prompt back — you compressed the geometry. Separately, audit 50 random cards against their evidence; track the confabulation rate as the guardrail on compile-cost savings.
Step 3 — Compile the queries and the negative space #
- Generate a probe set per entity: literal, persona, and workflow phrasings.
- Generate trap probes — queries that should retrieve the near-miss, not the target.
- Generate ~12 hard negatives per target concept as a named avoid-set.
Companies write marketing language; users type need language. "AI-powered data operations platform" and "who needs competitor price scraping" point at the same entity and share almost no vocabulary. HyDE [S1] attacks this from the query side at search time; the compiler attacks it from the index side, once, offline — generate the queries that should find each entity, embed those probes, attach them.
Then compile the negative space, which is what finally kills the listicle:
Target: companies that need ecommerce product-page scraping and price monitoring.
Generate 12 hard negatives. Each must share surface vocabulary with the target
and fail for one nameable reason. Return: label, why_it_looks_close, why_rejected.
You'll get the SEO listicle, the scraping agency (sells the service, doesn't need it), the affiliate directory, the CRM page that says "enrichment" eight times, the developer tutorial. Embed these as a named avoid-set.
Decision rule: trap probes are eval material, never index material. If a trap probe like "best web scraping tools 2026" retrieves your buyers instead of listicles, that's a leakage failure no other metric is allowed to excuse.
Step 4 — Embed facets, score with arithmetic you can audit #
- Store one vector per facet: problem, buyer, probe, avoid.
- Subtract avoid matches in scoring, not by vector arithmetic.
- Return per-surface score components on every result.
The beginner move is one entity, one vector, and it collapses too much. An entity matches a query for a reason — its problem, its buyer, its product — and different queries should hit different reasons:
| entity_id | surface | text | ||
|---|---|---|---|---|
| --- | --- | --- | ||
| co_8841 | problem | competitor price tracking across 40k SKUs | ||
| co_8841 | buyer | data operations team, actively hiring | ||
| co_8841 | probe | who needs product-page price monitoring | ||
| co_8841 | avoid | SEO listicles and scraping-tool roundups |
def entity_score(query_vec, surfaces, w=dict(problem=1.0, buyer=0.5,
probe=0.4, avoid=-0.8)):
s = {k: max_sim(query_vec, surfaces[k]) for k in surfaces}
return sum(w[k] * s[k] for k in s), s # return components for audit
Those weights are priors; the harness in Step 6 is how they become findings. But here's the second dead end, on principle: I tried the elegant version first — bake the negative directly into the profile vector (profile = avg(want) − λ·avg(avoid)). It's clean math and completely opaque when results look wrong. Separate positive and negative scores mean every buried result comes with a stated reason: "matched avoid-cluster: scraping-tool listicle, 0.74." That sentence is the difference between a recommender I can debug and one I can only apologize for.
This is the same instinct that makes ColBERT's late interaction beat single-vector compression [S2]: don't crush a multi-faceted object into one point and hope. Facet-level granularity captures most of the win without token-level cost. Two related rules while you're here: when averaging vectors into centroids, cluster first and build one centroid per interest — a user who saves scraping posts, crypto posts, and AI tools doesn't have a taste, they have three, and their average is a point corresponding to nothing. And weight sources by representativeness — a uniform average of homepage, docs, and three blog posts is blob embedding with extra steps.
Decision rule: graduate to vector arithmetic only after the explicit scoring behavior is understood and stable on the harness — not before.
Step 5 — Cascade: recall wide, judge narrow #
- Recall ~500: merge dense top-k, sparse/keyword top-k, and metadata-filtered candidates.
- Collapse ~100: dedupe by merging evidence, never deleting copies.
- Score ~25 with trust metadata, then LLM-judge the top 10 with diversity enforced.
The vector index never makes the final call. Hybrid recall isn't optional: dense embeddings soften exactly the tokens that matter most in B2B retrieval — "SOC 2," "Shopify," SKU formats, weird acronyms. SPLADE [S3] formalizes why lexical precision survives the embedding era; you don't need SPLADE itself to honor the principle.
Dedupe deserves its own line: if a pricing page, a job post, and a docs page all point at the same company need, the right output is one candidate with three pieces of evidence and a confidence bump — not three diluted rows. Scraping redundancy, merged correctly, is ranking signal. Scoring is where the trust metadata from Step 1 finally earns its keep: semantic fit + trigger strength + freshness + source trust − avoid-match. Meaning was the embedding's job; trust is this stage's job. Mixing them was never going to work. The final LLM judge reads 25 candidates with evidence, rejects weak fits with reasons, and enforces diversity — relevance gets things into the pool, but nobody wants ten clones from one cluster, which is the MMR insight [S4].
Decision rule: if a stage can't state why it buried a candidate, it doesn't ship.
Step 6 — Build the harness before trusting anything above #
- Label ~25 real queries with ~20 positives each — one afternoon.
- Run the naive pipeline first; that readout is your baseline forever.
- Tag every vector with
recipe_version; shadow-test new recipes before mixing.
EVAL = {
"queries": load("eval/queries.jsonl"), # ~25 real intents
"positives": load("eval/labeled_positives.jsonl"),# hand-labeled, ~20
"hard_negatives": load("eval/hard_negatives.jsonl"), # from the compiler, Step 3
"trap_probes": load("eval/trap_probes.jsonl"), # should NOT retrieve targets
"near_dupes": load("eval/dupe_traps.jsonl"),
}
def evaluate(pipeline):
return {
"precision@5": ...,
"recall@50": ...,
"hard_negative_leakage": pct_of_top10_matching_avoid_set(), # the metric
"trap_probe_failures": pct_of_trap_probes_retrieving_targets(),
"dupe_rate@10": ...,
"cluster_diversity@10": distinct_named_clusters_in_top10(),
}
Hard-negative leakage is the win metric because it directly measures the listicle problem — it's the number raw-embedding pipelines can't move and compiled pipelines can. Recall@50 is its guardrail, and here's my override rule: if recall@50 drops more than [YOUR FLOOR — I use 3 points], the leakage win doesn't count; the compiler is over-filtering, usually because cards omitted a real but unusual angle on an entity. The mitigation is keeping a raw-chunk dense index in stage-1 recall as a backstop.
Respect the sample size out loud: don't conclude anything from fewer than ~25 queries × ~20 labeled positives. Below that, a leakage delta is noise wearing a suit.
One more cheap test that pays for itself: cluster the corpus and have the LLM name each cluster from its members. If a cluster comes back as "Retail catalog monitoring for pricing and availability changes," the compiler and geometry are working. If it comes back as "AI-powered business platforms," that cluster is mush, and the mush almost always traces back to homogenized cards from Step 2. One LLM call per cluster — the cheapest integration test in the system.
Decision rule: never let two recipes' vectors mix silently in one index. "Did it get better or just different?" should be a query, not a debate.
When you should not do any of this #
| Situation | Why raw embedding is fine | ||
|---|---|---|---|
| --- | --- | ||
| Corpus is already clean, single-register prose (docs, papers, support articles) | The compile step's value is proportional to input chaos; here it's ~zero | ||
| Queries are lexical lookups ("error code 4012") | BM25 wins; meaning isn't the bottleneck | ||
| Corpus is tiny (<5k docs) | Hand-curate or stuff context; pipelines are overhead | ||
| Documents mutate hourly | Compile cost recurs per change; the amortization argument inverts | ||
| You can't afford the eval set | Then you can't see what the compiler does, and unmeasured complexity is pure risk |
And budget for the compiler's own bugs, all covered above, restated as a checklist: register collapse (homogenized cards — fight with contrastive prompts and banned words), confident confabulation (fight with INSUFFICIENT and the 50-card audit), recall ceiling (cards omit real angles — keep the raw-chunk backstop), silent recipe drift (fight with recipe_version).
Where this lands #
The result, restated with its slots showing: hard-negative leakage [BASELINE %] → [POST-COMPILE %], precision@5 [BEFORE] → [AFTER], recall@50 held within [GUARDRAIL DELTA], for a one-time compile cost of [INVOICE $] on [N] documents. Honest time accounting: the labeling was an afternoon; the full build was [ACTUAL CALENDAR TIME], and the contrastive prompt took [N] revisions before the card-pair check passed.
The last generation of retrieval advice asked which vector database and which embedding model. Mostly commoditized answers now. The question that separates systems is what, exactly, you embed — because LLMs made meaning manufacturable at index time for pennies, and almost nobody has rebuilt their pipeline around that fact yet. LLMs compile evidence into meaning. Embeddings turn meaning into geometry. Scoring turns geometry into judgment. The harness tells you whether any of it is true.
If you run the harness on your own corpus, I want to see your leakage numbers — especially if they don't move. That's the most useful failure report you can send me.
Sources #
- [S1] Gao et al., "Precise Zero-Shot Dense Retrieval without Relevance Labels" (HyDE) — generate hypothetical text, embed it, retrieve real documents. The query-time mirror of index-time probe generation. https://arxiv.org/abs/2212.10496
- [S2] Khattab & Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" — why preserving multiple interaction surfaces beats single-vector compression. https://arxiv.org/abs/2004.12832
- [S3] Formal et al., "SPLADE: Sparse Lexical and Expansion Model for Information Retrieval" — lexical precision survives the dense era; hybrid retrieval is principled, not nostalgic. https://arxiv.org/abs/2107.05720
- [S4] Carbonell & Goldstein, "The Use of MMR, Diversity-Based Reranking" — relevance alone produces redundant rankings; select for relevance plus novelty. https://aclanthology.org/X98-1025/