Reranking

Cross-encoder reranking for better retrieval. Plug-and-play with embedding pipelines.

FIG.

FIG. 00 · RERANKINGRANK BY SCORE

Reranking takes a query and a list of candidate documents and returns them re-ordered by relevance. Embedding models give you fast approximate retrieval (use embed for the query); reranking gives you precise top-K ordering with a cross-encoder pass. Combined, they're the standard recipe for high-quality RAG.

FIG. 01TWO-STAGE RETRIEVAL

SCHEMATIC

Stage one is embedding-driven ANN over your corpus to pull ~50 candidates. Stage two is a cross-encoder rerank that scores each candidate against the query and returns the top-K. The LLM only sees the reranked few.

Quickstart

import { rerank } from "ai"

const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  query: "How do I rotate an API key?",
  documents: [
    "API keys can be rotated in the dashboard under Keys → Rotate.",
    "Synapse Garden is a proxy for 100+ language models.",
    "Spend caps return HTTP 402 when exceeded.",
    "To rotate a key, click the rotate button next to the key.",
  ],
  topK: 3,
})

for (const r of results) {
  console.log(r.relevanceScore.toFixed(3), r.document)
}

Output:

0.927 To rotate a key, click the rotate button next to the key.
0.864 API keys can be rotated in the dashboard under Keys → Rotate.
0.131 Spend caps return HTTP 402 when exceeded.

The score is a calibrated relevance probability between 0 and 1. Documents not in the top-K are dropped from the result.

Two-stage retrieval (recommended)

The standard pattern: embed → ANN retrieval → rerank → generate.

import { embed, rerank, generateText } from "ai"

const userQuestion = "How do I rotate an API key?"

// 1. Embed the query
const { embedding: queryVec } = await embed({
  model: "openai/text-embedding-3-large",
  value: userQuestion,
})

// 2. ANN retrieval — pull top 50 candidates from the vector DB
const candidates = await db.query`
  SELECT text FROM docs
  ORDER BY vector <=> ${queryVec}
  LIMIT 50
`
const docs = candidates.map((r) => r.text)

// 3. Rerank — narrow down to the most relevant 5
const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: docs,
  topK: 5,
})
const top5 = results.map((r) => r.document)

// 4. Generate the answer
const { text } = await generateText({
  model: "openai/gpt-5.4-mini",
  system: "Answer using only the context provided. If the context doesn't contain the answer, say so.",
  prompt: `Context:\n${top5.join("\n---\n")}\n\nQuestion: ${userQuestion}`,
})

console.log(text)

Why both stages?

Embeddings are fast (one query embed + ANN lookup ≈ <10ms) but their similarity score is approximate.
Reranking is slower (up to a few hundred ms for 50 candidates) but the score is much more accurate.
Pulling 50 with ANN and reranking to 5 typically gives you the same recall as pulling 200+ with embeddings alone.

Available models

Filter the catalog by the Reranking modality on /models. Top picks:

Model	Best for
`cohere/rerank-english-v3.0`	English; the workhorse
`cohere/rerank-multilingual-v3.0`	100+ languages
`voyage/rerank-2.5`	Strong on code, legal, biomedical

Reranking models are billed per query (not per token) — check the model detail page for the live rate.

Score interpretation

Reranking scores are calibrated probabilities. Rough ranges:

Score	Meaning
> 0.9	Highly relevant — almost certainly answers the query
0.5 – 0.9	Likely relevant — useful context
0.1 – 0.5	Tangentially related
< 0.1	Unrelated noise

A common pattern: drop everything below 0.3 before passing to the LLM, even if it's in topK. Saves tokens, improves answer quality.

const filtered = results.filter((r) => r.relevanceScore > 0.3)

Long-document reranking

For long candidates, the model truncates to its max input length (typically 4–8K tokens per document). For documents longer than that, chunk first and rerank chunks:

const chunks = candidates.flatMap((doc) => chunkText(doc, { size: 600 }))

const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: chunks,
  topK: 10,
})

For chunk-aware retrieval, store the chunks separately in your vector DB and dedupe parent documents when building the LLM context.

Reranking without embeddings (BM25 + rerank)

Embedding models cost money; for some apps a classic BM25 keyword search is plenty for stage 1, then rerank for stage 2. Postgres tsvector, Elasticsearch, and OpenSearch all do this well.

const candidates = await db.query`
  SELECT text FROM docs
  WHERE to_tsvector('english', text) @@ websearch_to_tsquery(${userQuestion})
  ORDER BY ts_rank_cd(to_tsvector('english', text), websearch_to_tsquery(${userQuestion})) DESC
  LIMIT 50
`

// Rerank top 50 → top 5
const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: candidates.map((r) => r.text),
  topK: 5,
})

Hybrid (BM25 + embeddings) gives the best results — see your vector DB's docs for the merge strategy.

Caveats

Don't rerank thousands of candidates. Latency is roughly linear in candidate count. Stay under ~100 candidates per call; if you need more, batch.
Cache aggressively. Rerank is deterministic — same query + same documents → same scores. Hash and cache.
Rerank only when ranking matters. If the LLM only needs some relevant context (not the most relevant), embeddings alone are often enough.
Languages must match. Use rerank-multilingual-v3.0 if any of your docs aren't English. Otherwise scores are unreliable.