Video generation

Generate video from text, images, or video. Veo, Kling, Wan, Seedance, Grok Imagine.

FIG.

FIG. 00 · VIDEO GENERATIONTIMELINE OF FRAMES

Video generation works through experimental_generateVideo from the AI SDK — for the broader text and streaming surface see the AI SDK reference. The function is async and can take minutes — set generous timeouts and serve from a background job for production use.

FIG. 01ASYNC JOB

SCHEMATIC

Your call to `experimental_generateVideo` triggers an upstream job, which the SDK polls until the bytes are ready. Plan for 1–4 minutes per clip — don't await it inside an HTTP request, kick it off from a queue and notify on completion.

Quickstart

import { experimental_generateVideo as generateVideo } from "ai"
import fs from "node:fs"

const result = await generateVideo({
  model: "google/veo-3.1-generate-001",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  prompt: "A serene mountain landscape at sunset, clouds drifting by.",
  aspectRatio: "16:9",
  duration: 8,
})

fs.writeFileSync("output.mp4", result.videos[0].uint8Array)

Video generation is slow

A single 8-second clip typically takes 1–4 minutes. Don't await it inside an HTTP request — kick off a background job (Vercel Queue, Inngest, BullMQ) and notify the user when it's done.

Modes

Video models support several input shapes. Each mode has slightly different parameters.

Text-to-video (T2V)

The simplest case — a prompt becomes a video.

generateVideo({
  model: "google/veo-3.1-generate-001",
  prompt: "A red panda eating bamboo, cinematic close-up.",
  aspectRatio: "16:9",
  duration: 8,
})

Image-to-video (I2V)

An image animates with motion described in the prompt.

generateVideo({
  model: "alibaba/wan-v2.6-i2v",
  prompt: {
    image: "https://your-host/cat.png", // or a buffer
    text: "The scene comes to life with gentle movement.",
  },
  duration: 5,
})

First-and-last frame

Two images bookend a video; the model interpolates motion between them.

generateVideo({
  model: "klingai/kling-v2.6-i2v",
  prompt: {
    image: fs.readFileSync("start.png"),
    text: "Smooth transition between the two scenes.",
  },
  providerOptions: {
    klingai: {
      imageTail: fs.readFileSync("end.png"),
      mode: "pro",
    },
  },
})

Motion control (character animation)

A reference video provides motion; a character image is animated to perform those motions.

generateVideo({
  model: "klingai/kling-v2.6-motion-control",
  prompt: { image: fs.readFileSync("character.png") },
  providerOptions: {
    klingai: {
      videoUrl: "https://your-host/dance-reference.mp4",
      characterOrientation: "video",
      mode: "std",
    },
  },
})

Reference-to-video (R2V)

Multiple reference images provide character / style; the model generates a new scene.

generateVideo({
  model: "alibaba/wan-v2.6-r2v",
  prompt: "character1 and character2 have a friendly conversation in a cozy cafe",
  resolution: "1920x1080",
  duration: 4,
  providerOptions: {
    alibaba: {
      referenceUrls: [
        "https://your-host/cat.png",
        "https://your-host/dog.png",
      ],
      shotType: "single",
    },
  },
})

URL inputs vs buffer inputs

Some models accept buffers directly; others require URLs. The model detail page tells you which. For URL-only models, the workflow is:

Upload your input image or video to a public-readable bucket (Vercel Blob, S3, R2).
Pass the URL into prompt.image or providerOptions.<provider>.<field>.

Example with Vercel Blob:

import { put } from "@vercel/blob"

const { url } = await put("input.png", fs.readFileSync("input.png"), {
  access: "public",
  contentType: "image/png",
})

const result = await generateVideo({
  model: "klingai/kling-v2.6-i2v",
  prompt: {
    image: url,
    text: "The scene gently animates.",
  },
})

Models

Filter the catalog by the Video modality on /models. Common picks:

Model	Mode	Notes
`google/veo-3.1-generate-001`	T2V	Google's flagship; best overall realism
`klingai/kling-v3.0-t2v`	T2V	Cinematic; strong character work
`klingai/kling-v3.0-i2v`	I2V	Animate any still
`klingai/kling-v2.6-motion-control`	Motion control	Transfer dance / movement onto a character
`bytedance/seedance-2.0`	T2V	High-quality dance / motion specialist
`bytedance/seedance-2.0-fast`	T2V	Cheaper, faster Seedance
`alibaba/wan-v2.6-i2v`	I2V	Open weights; strong realism
`alibaba/wan-v2.6-r2v`	R2V	Reference-driven scene generation
`xai/grok-imagine-video`	T2V	xAI's video model

Common parameters

generateVideo({
  model: "...",
  prompt: "...",
  duration: 8,           // seconds — typically 4, 5, or 8
  aspectRatio: "16:9",   // "16:9" | "9:16" | "1:1" | model-specific
  resolution: "1920x1080", // alternative; varies per model
  fps: 24,               // some models accept this
  seed: 42,              // some models accept this for reproducibility
})

Per-model knobs go in providerOptions.<provider>.<field>.

Polling vs streaming

The AI SDK's experimental_generateVideo resolves when the video is ready (polling internally). For very long-running jobs, run it in a worker:

// background-job.ts
import { Queue } from "bullmq"
const queue = new Queue("video-generation")

await queue.add("generate", { prompt, model, userId })

// worker.ts
import { Worker } from "bullmq"
import { experimental_generateVideo as generateVideo } from "ai"

new Worker("video-generation", async (job) => {
  const { prompt, model, userId } = job.data
  const result = await generateVideo({ model, prompt, duration: 8 })
  await uploadAndNotify(userId, result.videos[0].uint8Array)
})

If you need finer control, drop down to the OpenResponses-style HTTP API and poll the job ID yourself.

Pricing

Video generation is billed per second of output. Rough ranges:

Tier	$/second
Cheap (Wan, Seedance fast)	$0.10–$0.30
Mid (Veo 2, Kling Standard)	$0.30–$0.50
Premium (Veo 3, Kling Pro, Seedance Pro)	$0.50–$1.50

A typical 8-second clip is $1–$12. Live rates on /models.

Audio?

Video models default to silent. A few (Veo 3.1 with audio, KlingAI Pro audio variants) generate synchronized soundtracks — flag the audio capability on the model detail page.

Caveats

Latency is unpredictable. Same prompt can take 60s or 4min depending on queue depth at the upstream provider. Always run async.
Content policies are stricter than text. Faces of public figures, copyrighted characters, and explicit content are uniformly refused.
Watermarks — Veo and Imagen embed C2PA provenance metadata. Most other models add a small visible watermark unless you're on a provider's higher tier.
Audio quality — for models that generate sound, expect "good enough for prototype" not "broadcast-ready." Bring your own audio post-production for final cuts.
Reference video length — motion control inputs typically cap at 10–15 seconds. Longer references are truncated.

Video generation

On this page