Video generation
Generate video from text, images, or video. Veo, Kling, Wan, Seedance, Grok Imagine.
Video generation works through experimental_generateVideo from the AI SDK — for the broader text and streaming surface see the AI SDK reference. The function is async and can take minutes — set generous timeouts and serve from a background job for production use.
Quickstart
import { experimental_generateVideo as generateVideo } from "ai"
import fs from "node:fs"
const result = await generateVideo({
model: "google/veo-3.1-generate-001",
baseURL: "https://synapse.garden/api/v1",
apiKey: process.env.MG_KEY,
prompt: "A serene mountain landscape at sunset, clouds drifting by.",
aspectRatio: "16:9",
duration: 8,
})
fs.writeFileSync("output.mp4", result.videos[0].uint8Array)A single 8-second clip typically takes 1–4 minutes. Don't await it inside an HTTP request — kick off a background job (Vercel Queue, Inngest, BullMQ) and notify the user when it's done.
Modes
Video models support several input shapes. Each mode has slightly different parameters.
Text-to-video (T2V)
The simplest case — a prompt becomes a video.
generateVideo({
model: "google/veo-3.1-generate-001",
prompt: "A red panda eating bamboo, cinematic close-up.",
aspectRatio: "16:9",
duration: 8,
})Image-to-video (I2V)
An image animates with motion described in the prompt.
generateVideo({
model: "alibaba/wan-v2.6-i2v",
prompt: {
image: "https://your-host/cat.png", // or a buffer
text: "The scene comes to life with gentle movement.",
},
duration: 5,
})First-and-last frame
Two images bookend a video; the model interpolates motion between them.
generateVideo({
model: "klingai/kling-v2.6-i2v",
prompt: {
image: fs.readFileSync("start.png"),
text: "Smooth transition between the two scenes.",
},
providerOptions: {
klingai: {
imageTail: fs.readFileSync("end.png"),
mode: "pro",
},
},
})Motion control (character animation)
A reference video provides motion; a character image is animated to perform those motions.
generateVideo({
model: "klingai/kling-v2.6-motion-control",
prompt: { image: fs.readFileSync("character.png") },
providerOptions: {
klingai: {
videoUrl: "https://your-host/dance-reference.mp4",
characterOrientation: "video",
mode: "std",
},
},
})Reference-to-video (R2V)
Multiple reference images provide character / style; the model generates a new scene.
generateVideo({
model: "alibaba/wan-v2.6-r2v",
prompt: "character1 and character2 have a friendly conversation in a cozy cafe",
resolution: "1920x1080",
duration: 4,
providerOptions: {
alibaba: {
referenceUrls: [
"https://your-host/cat.png",
"https://your-host/dog.png",
],
shotType: "single",
},
},
})URL inputs vs buffer inputs
Some models accept buffers directly; others require URLs. The model detail page tells you which. For URL-only models, the workflow is:
- Upload your input image or video to a public-readable bucket (Vercel Blob, S3, R2).
- Pass the URL into
prompt.imageorproviderOptions.<provider>.<field>.
Example with Vercel Blob:
import { put } from "@vercel/blob"
const { url } = await put("input.png", fs.readFileSync("input.png"), {
access: "public",
contentType: "image/png",
})
const result = await generateVideo({
model: "klingai/kling-v2.6-i2v",
prompt: {
image: url,
text: "The scene gently animates.",
},
})Models
Filter the catalog by the Video modality on /models. Common picks:
| Model | Mode | Notes |
|---|---|---|
google/veo-3.1-generate-001 | T2V | Google's flagship; best overall realism |
klingai/kling-v3.0-t2v | T2V | Cinematic; strong character work |
klingai/kling-v3.0-i2v | I2V | Animate any still |
klingai/kling-v2.6-motion-control | Motion control | Transfer dance / movement onto a character |
bytedance/seedance-2.0 | T2V | High-quality dance / motion specialist |
bytedance/seedance-2.0-fast | T2V | Cheaper, faster Seedance |
alibaba/wan-v2.6-i2v | I2V | Open weights; strong realism |
alibaba/wan-v2.6-r2v | R2V | Reference-driven scene generation |
xai/grok-imagine-video | T2V | xAI's video model |
Common parameters
generateVideo({
model: "...",
prompt: "...",
duration: 8, // seconds — typically 4, 5, or 8
aspectRatio: "16:9", // "16:9" | "9:16" | "1:1" | model-specific
resolution: "1920x1080", // alternative; varies per model
fps: 24, // some models accept this
seed: 42, // some models accept this for reproducibility
})Per-model knobs go in providerOptions.<provider>.<field>.
Polling vs streaming
The AI SDK's experimental_generateVideo resolves when the video is ready (polling internally). For very long-running jobs, run it in a worker:
// background-job.ts
import { Queue } from "bullmq"
const queue = new Queue("video-generation")
await queue.add("generate", { prompt, model, userId })// worker.ts
import { Worker } from "bullmq"
import { experimental_generateVideo as generateVideo } from "ai"
new Worker("video-generation", async (job) => {
const { prompt, model, userId } = job.data
const result = await generateVideo({ model, prompt, duration: 8 })
await uploadAndNotify(userId, result.videos[0].uint8Array)
})If you need finer control, drop down to the OpenResponses-style HTTP API and poll the job ID yourself.
Pricing
Video generation is billed per second of output. Rough ranges:
| Tier | $/second |
|---|---|
| Cheap (Wan, Seedance fast) | $0.10–$0.30 |
| Mid (Veo 2, Kling Standard) | $0.30–$0.50 |
| Premium (Veo 3, Kling Pro, Seedance Pro) | $0.50–$1.50 |
A typical 8-second clip is $1–$12. Live rates on /models.
Video models default to silent. A few (Veo 3.1 with audio, KlingAI Pro audio variants) generate synchronized soundtracks — flag the audio capability on the model detail page.
Caveats
- Latency is unpredictable. Same prompt can take 60s or 4min depending on queue depth at the upstream provider. Always run async.
- Content policies are stricter than text. Faces of public figures, copyrighted characters, and explicit content are uniformly refused.
- Watermarks — Veo and Imagen embed C2PA provenance metadata. Most other models add a small visible watermark unless you're on a provider's higher tier.
- Audio quality — for models that generate sound, expect "good enough for prototype" not "broadcast-ready." Bring your own audio post-production for final cuts.
- Reference video length — motion control inputs typically cap at 10–15 seconds. Longer references are truncated.