Reranking

Cross-encoder reranking for better retrieval. Plug-and-play with embedding pipelines.

FIG.
FIG. 00 · RERANKINGRANK BY SCORE

Reranking takes a query and a list of candidate documents and returns them re-ordered by relevance. Embedding models give you fast approximate retrieval (use embed for the query); reranking gives you precise top-K ordering with a cross-encoder pass. Combined, they're the standard recipe for high-quality RAG.

FIG. 01TWO-STAGE RETRIEVAL
SCHEMATIC
Stage one is embedding-driven ANN over your corpus to pull ~50 candidates. Stage two is a cross-encoder rerank that scores each candidate against the query and returns the top-K. The LLM only sees the reranked few.

Quickstart

import { rerank } from "ai"

const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  query: "How do I rotate an API key?",
  documents: [
    "API keys can be rotated in the dashboard under Keys → Rotate.",
    "Synapse Garden is a proxy for 100+ language models.",
    "Spend caps return HTTP 402 when exceeded.",
    "To rotate a key, click the rotate button next to the key.",
  ],
  topK: 3,
})

for (const r of results) {
  console.log(r.relevanceScore.toFixed(3), r.document)
}

Output:

0.927 To rotate a key, click the rotate button next to the key.
0.864 API keys can be rotated in the dashboard under Keys → Rotate.
0.131 Spend caps return HTTP 402 when exceeded.

The score is a calibrated relevance probability between 0 and 1. Documents not in the top-K are dropped from the result.

The standard pattern: embed → ANN retrieval → rerank → generate.

import { embed, rerank, generateText } from "ai"

const userQuestion = "How do I rotate an API key?"

// 1. Embed the query
const { embedding: queryVec } = await embed({
  model: "openai/text-embedding-3-large",
  value: userQuestion,
})

// 2. ANN retrieval — pull top 50 candidates from the vector DB
const candidates = await db.query`
  SELECT text FROM docs
  ORDER BY vector <=> ${queryVec}
  LIMIT 50
`
const docs = candidates.map((r) => r.text)

// 3. Rerank — narrow down to the most relevant 5
const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: docs,
  topK: 5,
})
const top5 = results.map((r) => r.document)

// 4. Generate the answer
const { text } = await generateText({
  model: "openai/gpt-5.4-mini",
  system: "Answer using only the context provided. If the context doesn't contain the answer, say so.",
  prompt: `Context:\n${top5.join("\n---\n")}\n\nQuestion: ${userQuestion}`,
})

console.log(text)

Why both stages?

  • Embeddings are fast (one query embed + ANN lookup ≈ <10ms) but their similarity score is approximate.
  • Reranking is slower (up to a few hundred ms for 50 candidates) but the score is much more accurate.
  • Pulling 50 with ANN and reranking to 5 typically gives you the same recall as pulling 200+ with embeddings alone.

Available models

Filter the catalog by the Reranking modality on /models. Top picks:

ModelBest for
cohere/rerank-english-v3.0English; the workhorse
cohere/rerank-multilingual-v3.0100+ languages
voyage/rerank-2.5Strong on code, legal, biomedical

Reranking models are billed per query (not per token) — check the model detail page for the live rate.

Score interpretation

Reranking scores are calibrated probabilities. Rough ranges:

ScoreMeaning
> 0.9Highly relevant — almost certainly answers the query
0.5 – 0.9Likely relevant — useful context
0.1 – 0.5Tangentially related
< 0.1Unrelated noise

A common pattern: drop everything below 0.3 before passing to the LLM, even if it's in topK. Saves tokens, improves answer quality.

const filtered = results.filter((r) => r.relevanceScore > 0.3)

Long-document reranking

For long candidates, the model truncates to its max input length (typically 4–8K tokens per document). For documents longer than that, chunk first and rerank chunks:

const chunks = candidates.flatMap((doc) => chunkText(doc, { size: 600 }))

const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: chunks,
  topK: 10,
})

For chunk-aware retrieval, store the chunks separately in your vector DB and dedupe parent documents when building the LLM context.

Reranking without embeddings (BM25 + rerank)

Embedding models cost money; for some apps a classic BM25 keyword search is plenty for stage 1, then rerank for stage 2. Postgres tsvector, Elasticsearch, and OpenSearch all do this well.

const candidates = await db.query`
  SELECT text FROM docs
  WHERE to_tsvector('english', text) @@ websearch_to_tsquery(${userQuestion})
  ORDER BY ts_rank_cd(to_tsvector('english', text), websearch_to_tsquery(${userQuestion})) DESC
  LIMIT 50
`

// Rerank top 50 → top 5
const { results } = await rerank({
  model: "cohere/rerank-english-v3.0",
  query: userQuestion,
  documents: candidates.map((r) => r.text),
  topK: 5,
})

Hybrid (BM25 + embeddings) gives the best results — see your vector DB's docs for the merge strategy.

Caveats

  • Don't rerank thousands of candidates. Latency is roughly linear in candidate count. Stay under ~100 candidates per call; if you need more, batch.
  • Cache aggressively. Rerank is deterministic — same query + same documents → same scores. Hash and cache.
  • Rerank only when ranking matters. If the LLM only needs some relevant context (not the most relevant), embeddings alone are often enough.
  • Languages must match. Use rerank-multilingual-v3.0 if any of your docs aren't English. Otherwise scores are unreliable.