Video generation

Generate video from text, images, or video. Veo, Kling, Wan, Seedance, Grok Imagine.

FIG.
FIG. 00 · VIDEO GENERATIONTIMELINE OF FRAMES

Video generation works through experimental_generateVideo from the AI SDK — for the broader text and streaming surface see the AI SDK reference. The function is async and can take minutes — set generous timeouts and serve from a background job for production use.

FIG. 01ASYNC JOB
SCHEMATIC
Your call to `experimental_generateVideo` triggers an upstream job, which the SDK polls until the bytes are ready. Plan for 1–4 minutes per clip — don't await it inside an HTTP request, kick it off from a queue and notify on completion.

Quickstart

import { experimental_generateVideo as generateVideo } from "ai"
import fs from "node:fs"

const result = await generateVideo({
  model: "google/veo-3.1-generate-001",
  baseURL: "https://synapse.garden/api/v1",
  apiKey: process.env.MG_KEY,
  prompt: "A serene mountain landscape at sunset, clouds drifting by.",
  aspectRatio: "16:9",
  duration: 8,
})

fs.writeFileSync("output.mp4", result.videos[0].uint8Array)
Video generation is slow

A single 8-second clip typically takes 1–4 minutes. Don't await it inside an HTTP request — kick off a background job (Vercel Queue, Inngest, BullMQ) and notify the user when it's done.

Modes

Video models support several input shapes. Each mode has slightly different parameters.

Text-to-video (T2V)

The simplest case — a prompt becomes a video.

generateVideo({
  model: "google/veo-3.1-generate-001",
  prompt: "A red panda eating bamboo, cinematic close-up.",
  aspectRatio: "16:9",
  duration: 8,
})

Image-to-video (I2V)

An image animates with motion described in the prompt.

generateVideo({
  model: "alibaba/wan-v2.6-i2v",
  prompt: {
    image: "https://your-host/cat.png", // or a buffer
    text: "The scene comes to life with gentle movement.",
  },
  duration: 5,
})

First-and-last frame

Two images bookend a video; the model interpolates motion between them.

generateVideo({
  model: "klingai/kling-v2.6-i2v",
  prompt: {
    image: fs.readFileSync("start.png"),
    text: "Smooth transition between the two scenes.",
  },
  providerOptions: {
    klingai: {
      imageTail: fs.readFileSync("end.png"),
      mode: "pro",
    },
  },
})

Motion control (character animation)

A reference video provides motion; a character image is animated to perform those motions.

generateVideo({
  model: "klingai/kling-v2.6-motion-control",
  prompt: { image: fs.readFileSync("character.png") },
  providerOptions: {
    klingai: {
      videoUrl: "https://your-host/dance-reference.mp4",
      characterOrientation: "video",
      mode: "std",
    },
  },
})

Reference-to-video (R2V)

Multiple reference images provide character / style; the model generates a new scene.

generateVideo({
  model: "alibaba/wan-v2.6-r2v",
  prompt: "character1 and character2 have a friendly conversation in a cozy cafe",
  resolution: "1920x1080",
  duration: 4,
  providerOptions: {
    alibaba: {
      referenceUrls: [
        "https://your-host/cat.png",
        "https://your-host/dog.png",
      ],
      shotType: "single",
    },
  },
})

URL inputs vs buffer inputs

Some models accept buffers directly; others require URLs. The model detail page tells you which. For URL-only models, the workflow is:

  1. Upload your input image or video to a public-readable bucket (Vercel Blob, S3, R2).
  2. Pass the URL into prompt.image or providerOptions.<provider>.<field>.

Example with Vercel Blob:

import { put } from "@vercel/blob"

const { url } = await put("input.png", fs.readFileSync("input.png"), {
  access: "public",
  contentType: "image/png",
})

const result = await generateVideo({
  model: "klingai/kling-v2.6-i2v",
  prompt: {
    image: url,
    text: "The scene gently animates.",
  },
})

Models

Filter the catalog by the Video modality on /models. Common picks:

ModelModeNotes
google/veo-3.1-generate-001T2VGoogle's flagship; best overall realism
klingai/kling-v3.0-t2vT2VCinematic; strong character work
klingai/kling-v3.0-i2vI2VAnimate any still
klingai/kling-v2.6-motion-controlMotion controlTransfer dance / movement onto a character
bytedance/seedance-2.0T2VHigh-quality dance / motion specialist
bytedance/seedance-2.0-fastT2VCheaper, faster Seedance
alibaba/wan-v2.6-i2vI2VOpen weights; strong realism
alibaba/wan-v2.6-r2vR2VReference-driven scene generation
xai/grok-imagine-videoT2VxAI's video model

Common parameters

generateVideo({
  model: "...",
  prompt: "...",
  duration: 8,           // seconds — typically 4, 5, or 8
  aspectRatio: "16:9",   // "16:9" | "9:16" | "1:1" | model-specific
  resolution: "1920x1080", // alternative; varies per model
  fps: 24,               // some models accept this
  seed: 42,              // some models accept this for reproducibility
})

Per-model knobs go in providerOptions.<provider>.<field>.

Polling vs streaming

The AI SDK's experimental_generateVideo resolves when the video is ready (polling internally). For very long-running jobs, run it in a worker:

// background-job.ts
import { Queue } from "bullmq"
const queue = new Queue("video-generation")

await queue.add("generate", { prompt, model, userId })
// worker.ts
import { Worker } from "bullmq"
import { experimental_generateVideo as generateVideo } from "ai"

new Worker("video-generation", async (job) => {
  const { prompt, model, userId } = job.data
  const result = await generateVideo({ model, prompt, duration: 8 })
  await uploadAndNotify(userId, result.videos[0].uint8Array)
})

If you need finer control, drop down to the OpenResponses-style HTTP API and poll the job ID yourself.

Pricing

Video generation is billed per second of output. Rough ranges:

Tier$/second
Cheap (Wan, Seedance fast)$0.10–$0.30
Mid (Veo 2, Kling Standard)$0.30–$0.50
Premium (Veo 3, Kling Pro, Seedance Pro)$0.50–$1.50

A typical 8-second clip is $1–$12. Live rates on /models.

Audio?

Video models default to silent. A few (Veo 3.1 with audio, KlingAI Pro audio variants) generate synchronized soundtracks — flag the audio capability on the model detail page.

Caveats

  • Latency is unpredictable. Same prompt can take 60s or 4min depending on queue depth at the upstream provider. Always run async.
  • Content policies are stricter than text. Faces of public figures, copyrighted characters, and explicit content are uniformly refused.
  • Watermarks — Veo and Imagen embed C2PA provenance metadata. Most other models add a small visible watermark unless you're on a provider's higher tier.
  • Audio quality — for models that generate sound, expect "good enough for prototype" not "broadcast-ready." Bring your own audio post-production for final cuts.
  • Reference video length — motion control inputs typically cap at 10–15 seconds. Longer references are truncated.