Reranking
Cross-encoder reranking for better retrieval. Plug-and-play with embedding pipelines.
Reranking takes a query and a list of candidate documents and returns them re-ordered by relevance. Embedding models give you fast approximate retrieval (use embed for the query); reranking gives you precise top-K ordering with a cross-encoder pass. Combined, they're the standard recipe for high-quality RAG.
Quickstart
import { rerank } from "ai"
const { results } = await rerank({
model: "cohere/rerank-english-v3.0",
baseURL: "https://synapse.garden/api/v1",
apiKey: process.env.MG_KEY,
query: "How do I rotate an API key?",
documents: [
"API keys can be rotated in the dashboard under Keys → Rotate.",
"Synapse Garden is a proxy for 100+ language models.",
"Spend caps return HTTP 402 when exceeded.",
"To rotate a key, click the rotate button next to the key.",
],
topK: 3,
})
for (const r of results) {
console.log(r.relevanceScore.toFixed(3), r.document)
}Output:
0.927 To rotate a key, click the rotate button next to the key.
0.864 API keys can be rotated in the dashboard under Keys → Rotate.
0.131 Spend caps return HTTP 402 when exceeded.The score is a calibrated relevance probability between 0 and 1. Documents not in the top-K are dropped from the result.
Two-stage retrieval (recommended)
The standard pattern: embed → ANN retrieval → rerank → generate.
import { embed, rerank, generateText } from "ai"
const userQuestion = "How do I rotate an API key?"
// 1. Embed the query
const { embedding: queryVec } = await embed({
model: "openai/text-embedding-3-large",
value: userQuestion,
})
// 2. ANN retrieval — pull top 50 candidates from the vector DB
const candidates = await db.query`
SELECT text FROM docs
ORDER BY vector <=> ${queryVec}
LIMIT 50
`
const docs = candidates.map((r) => r.text)
// 3. Rerank — narrow down to the most relevant 5
const { results } = await rerank({
model: "cohere/rerank-english-v3.0",
query: userQuestion,
documents: docs,
topK: 5,
})
const top5 = results.map((r) => r.document)
// 4. Generate the answer
const { text } = await generateText({
model: "openai/gpt-5.4-mini",
system: "Answer using only the context provided. If the context doesn't contain the answer, say so.",
prompt: `Context:\n${top5.join("\n---\n")}\n\nQuestion: ${userQuestion}`,
})
console.log(text)Why both stages?
- Embeddings are fast (one query embed + ANN lookup ≈ <10ms) but their similarity score is approximate.
- Reranking is slower (up to a few hundred ms for 50 candidates) but the score is much more accurate.
- Pulling 50 with ANN and reranking to 5 typically gives you the same recall as pulling 200+ with embeddings alone.
Available models
Filter the catalog by the Reranking modality on /models. Top picks:
| Model | Best for |
|---|---|
cohere/rerank-english-v3.0 | English; the workhorse |
cohere/rerank-multilingual-v3.0 | 100+ languages |
voyage/rerank-2.5 | Strong on code, legal, biomedical |
Reranking models are billed per query (not per token) — check the model detail page for the live rate.
Score interpretation
Reranking scores are calibrated probabilities. Rough ranges:
| Score | Meaning |
|---|---|
| > 0.9 | Highly relevant — almost certainly answers the query |
| 0.5 – 0.9 | Likely relevant — useful context |
| 0.1 – 0.5 | Tangentially related |
| < 0.1 | Unrelated noise |
A common pattern: drop everything below 0.3 before passing to the LLM, even if it's in topK. Saves tokens, improves answer quality.
const filtered = results.filter((r) => r.relevanceScore > 0.3)Long-document reranking
For long candidates, the model truncates to its max input length (typically 4–8K tokens per document). For documents longer than that, chunk first and rerank chunks:
const chunks = candidates.flatMap((doc) => chunkText(doc, { size: 600 }))
const { results } = await rerank({
model: "cohere/rerank-english-v3.0",
query: userQuestion,
documents: chunks,
topK: 10,
})For chunk-aware retrieval, store the chunks separately in your vector DB and dedupe parent documents when building the LLM context.
Reranking without embeddings (BM25 + rerank)
Embedding models cost money; for some apps a classic BM25 keyword search is plenty for stage 1, then rerank for stage 2. Postgres tsvector, Elasticsearch, and OpenSearch all do this well.
const candidates = await db.query`
SELECT text FROM docs
WHERE to_tsvector('english', text) @@ websearch_to_tsquery(${userQuestion})
ORDER BY ts_rank_cd(to_tsvector('english', text), websearch_to_tsquery(${userQuestion})) DESC
LIMIT 50
`
// Rerank top 50 → top 5
const { results } = await rerank({
model: "cohere/rerank-english-v3.0",
query: userQuestion,
documents: candidates.map((r) => r.text),
topK: 5,
})Hybrid (BM25 + embeddings) gives the best results — see your vector DB's docs for the merge strategy.
Caveats
- Don't rerank thousands of candidates. Latency is roughly linear in candidate count. Stay under ~100 candidates per call; if you need more, batch.
- Cache aggressively. Rerank is deterministic — same query + same documents → same scores. Hash and cache.
- Rerank only when ranking matters. If the LLM only needs some relevant context (not the most relevant), embeddings alone are often enough.
- Languages must match. Use
rerank-multilingual-v3.0if any of your docs aren't English. Otherwise scores are unreliable.