OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Sixteen provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.Documentation Index
Fetch the complete documentation index at: https://docs.openclaw.ai/llms.txt
Use this file to discover all available pages before exploring further.
The
video_generate tool only appears when at least one video-generation
provider is available. If you do not see it in your agent tools, set a
provider API key or configure agents.defaults.videoGenerationModel.generate- text-to-video requests with no reference media.imageToVideo- request includes one or more reference images.videoToVideo- request includes one or more reference videos.
action=list.
Quick start
How async generation works
Video generation is asynchronous. When the agent callsvideo_generate in a
session:
- OpenClaw submits the request to the provider and immediately returns a task id.
- The provider processes the job in the background (typically 30 seconds to several minutes depending on the provider and resolution; slow queue-backed providers can run up to the configured timeout).
- When the video is ready, OpenClaw wakes the same session with an internal completion event.
- The agent tells the user and attaches the finished video. In group/channel chats that use message-tool-only visible delivery, the agent relays the result through the message tool instead of OpenClaw posting it directly.
video_generate calls in the same
session return the current task status instead of starting another
generation. Use openclaw tasks list or openclaw tasks show <taskId> to
check progress from the CLI.
Outside of session-backed agent runs (for example, direct tool invocations),
the tool falls back to inline generation and returns the final media path
in the same turn.
Generated video files are saved under OpenClaw-managed media storage when
the provider returns bytes. The default generated-video save cap follows
the video media limit, and agents.defaults.mediaMaxMb raises it for
larger renders. When a provider also returns a hosted output URL, OpenClaw
can deliver that URL instead of failing the task if local persistence
rejects an oversized file.
Task lifecycle
| State | Meaning |
|---|---|
queued | Task created, waiting for the provider to accept it. |
running | Provider is processing (typically 30 seconds to several minutes depending on provider and resolution). |
succeeded | Video ready; the agent wakes and posts it to the conversation. |
failed | Provider error or timeout; the agent wakes with error details. |
queued or running for the current session,
video_generate returns the existing task status instead of starting a new
one. Use action: "status" to check explicitly without triggering a new
generation.
Supported providers
| Provider | Default model | Text | Image ref | Video ref | Auth |
|---|---|---|---|---|---|
| Alibaba | wan2.6-t2v | ✓ | Yes (remote URL) | Yes (remote URL) | MODELSTUDIO_API_KEY |
| BytePlus (1.0) | seedance-1-0-pro-250528 | ✓ | Up to 2 images (I2V models only; first + last frame) | - | BYTEPLUS_API_KEY |
| BytePlus Seedance 1.5 | seedance-1-5-pro-251215 | ✓ | Up to 2 images (first + last frame via role) | - | BYTEPLUS_API_KEY |
| BytePlus Seedance 2.0 | dreamina-seedance-2-0-260128 | ✓ | Up to 9 reference images | Up to 3 videos | BYTEPLUS_API_KEY |
| ComfyUI | workflow | ✓ | 1 image | - | COMFY_API_KEY or COMFY_CLOUD_API_KEY |
| DeepInfra | Pixverse/Pixverse-T2V | ✓ | - | - | DEEPINFRA_API_KEY |
| fal | fal-ai/minimax/video-01-live | ✓ | 1 image; up to 9 with Seedance reference-to-video | Up to 3 videos with Seedance reference-to-video | FAL_KEY |
veo-3.1-fast-generate-preview | ✓ | 1 image | 1 video | GEMINI_API_KEY | |
| MiniMax | MiniMax-Hailuo-2.3 | ✓ | 1 image | - | MINIMAX_API_KEY or MiniMax OAuth |
| OpenAI | sora-2 | ✓ | 1 image | 1 video | OPENAI_API_KEY |
| OpenRouter | google/veo-3.1-fast | ✓ | Up to 4 images (first/last frame or references) | - | OPENROUTER_API_KEY |
| Qwen | wan2.6-t2v | ✓ | Yes (remote URL) | Yes (remote URL) | QWEN_API_KEY |
| Runway | gen4.5 | ✓ | 1 image | 1 video | RUNWAYML_API_SECRET |
| Together | Wan-AI/Wan2.2-T2V-A14B | ✓ | 1 image | - | TOGETHER_API_KEY |
| Vydra | veo3 | ✓ | 1 image (kling) | - | VYDRA_API_KEY |
| xAI | grok-imagine-video | ✓ | 1 first-frame image or up to 7 reference_images | 1 video | XAI_API_KEY |
video_generate action=list to inspect available providers, models, and
runtime modes at runtime.
Capability matrix
The explicit mode contract used byvideo_generate, contract tests, and
the shared live sweep:
| Provider | generate | imageToVideo | videoToVideo | Shared live lanes today |
|---|---|---|---|---|
| Alibaba | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs |
| BytePlus | ✓ | ✓ | - | generate, imageToVideo |
| ComfyUI | ✓ | ✓ | - | Not in the shared sweep; workflow-specific coverage lives with Comfy tests |
| DeepInfra | ✓ | - | - | generate; native DeepInfra video schemas are text-to-video in the bundled contract |
| fal | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo only when using Seedance reference-to-video |
| ✓ | ✓ | ✓ | generate, imageToVideo; shared videoToVideo skipped because the current buffer-backed Gemini/Veo sweep does not accept that input | |
| MiniMax | ✓ | ✓ | - | generate, imageToVideo |
| OpenAI | ✓ | ✓ | ✓ | generate, imageToVideo; shared videoToVideo skipped because this org/input path currently needs provider-side inpaint/remix access |
| OpenRouter | ✓ | ✓ | - | generate, imageToVideo |
| Qwen | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider needs remote http(s) video URLs |
| Runway | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo runs only when the selected model is runway/gen4_aleph |
| Together | ✓ | ✓ | - | generate, imageToVideo |
| Vydra | ✓ | ✓ | - | generate; shared imageToVideo skipped because bundled veo3 is text-only and bundled kling requires a remote image URL |
| xAI | ✓ | ✓ | ✓ | generate, imageToVideo; videoToVideo skipped because this provider currently needs a remote MP4 URL |
Tool parameters
Required
Text description of the video to generate. Required for
action: "generate".Content inputs
Single reference image (path or URL).
Multiple reference images (up to 9).
Optional per-position role hints parallel to the combined image list.
Canonical values:
first_frame, last_frame, reference_image.Single reference video (path or URL).
Multiple reference videos (up to 4).
Optional per-position role hints parallel to the combined video list.
Canonical value:
reference_video.Single reference audio (path or URL). Used for background music or voice
reference when the provider supports audio inputs.
Multiple reference audios (up to 3).
Optional per-position role hints parallel to the combined audio list.
Canonical value:
reference_audio.Role hints are forwarded to the provider as-is. Canonical values come from
the
VideoGenerationAssetRole union but providers may accept additional
role strings. *Roles arrays must not have more entries than the
corresponding reference list; off-by-one mistakes fail with a clear error.
Use an empty string to leave a slot unset. For xAI, set every image role to
reference_image to use its reference_images generation mode; omit the
role or use first_frame for single-image image-to-video.Style controls
Aspect-ratio hint such as
1:1, 16:9, 9:16, adaptive, or a provider-specific value. OpenClaw normalizes or ignores unsupported values per provider.Resolution hint such as
480P, 720P, 768P, 1080P, 4K, or a provider-specific value. OpenClaw normalizes or ignores unsupported values per provider.Target duration in seconds (rounded to nearest provider-supported value).
Size hint when the provider supports it.
Enable generated audio in the output when supported. Distinct from
audioRef* (inputs).Toggle provider watermarking when supported.
adaptive is a provider-specific sentinel: it is forwarded as-is to
providers that declare adaptive in their capabilities (e.g. BytePlus
Seedance uses it to auto-detect the ratio from the input image
dimensions). Providers that do not declare it surface the value via
details.ignoredOverrides in the tool result so the drop is visible.
Advanced
"status" returns the current session task; "list" inspects providers.Provider/model override (e.g.
runway/gen4.5).Output filename hint.
Optional provider operation timeout in milliseconds. When omitted, OpenClaw uses
agents.defaults.videoGenerationModel.timeoutMs if configured.Provider-specific options as a JSON object (e.g.
{"seed": 42, "draft": true}).
Providers that declare a typed schema validate the keys and types; unknown
keys or mismatches skip the candidate during fallback. Providers without a
declared schema receive the options as-is. Run video_generate action=list
to see what each provider accepts.Not all providers support all parameters. OpenClaw normalizes duration to
the closest provider-supported value, and remaps translated geometry hints
such as size-to-aspect-ratio when a fallback provider exposes a different
control surface. Truly unsupported overrides are ignored on a best-effort
basis and reported as warnings in the tool result. Hard capability limits
(such as too many reference inputs) fail before submission. Tool results
report applied settings;
details.normalization captures any
requested-to-applied translation.- No reference media →
generate - Any image reference →
imageToVideo - Any video reference →
videoToVideo - Reference audio inputs do not change the resolved mode; they apply on
top of whatever mode the image/video references select, and only work
with providers that declare
maxInputAudios.
Fallback and typed options
Some capability checks are applied at the fallback layer rather than the tool boundary, so a request that exceeds the primary provider’s limits can still run on a capable fallback:- Active candidate declaring no
maxInputAudios(or0) is skipped when the request contains audio references; next candidate is tried. - Active candidate’s
maxDurationSecondsbelow the requesteddurationSecondswith no declaredsupportedDurationSecondslist → skipped. - Request contains
providerOptionsand the active candidate explicitly declares a typedproviderOptionsschema → skipped if supplied keys are not in the schema or value types do not match. Providers without a declared schema receive options as-is (backward-compatible pass-through). A provider can opt out of all provider options by declaring an empty schema (capabilities.providerOptions: {}), which causes the same skip as a type mismatch.
warn so operators see when
their primary provider was passed over; subsequent skips log at debug to
keep long fallback chains quiet. If every candidate is skipped, the
aggregated error includes the skip reason for each.
Actions
| Action | What it does |
|---|---|
generate | Default. Create a video from the given prompt and optional reference inputs. |
status | Check the state of the in-flight video task for the current session without starting another generation. |
list | Show available providers, models, and their capabilities. |
Model selection
OpenClaw resolves the model in this order:modeltool parameter - if the agent specifies one in the call.videoGenerationModel.primaryfrom config.videoGenerationModel.fallbacksin order.- Auto-detection - providers that have valid auth, starting with the current default provider, then remaining providers in alphabetical order.
agents.defaults.mediaGenerationAutoProviderFallback: false to use
only the explicit model, primary, and fallbacks entries.
Provider notes
Alibaba
Alibaba
Uses DashScope / Model Studio async endpoint. Reference images and
videos must be remote
http(s) URLs.BytePlus (1.0)
BytePlus (1.0)
Provider id:
byteplus.Models: seedance-1-0-pro-250528 (default),
seedance-1-0-pro-t2v-250528, seedance-1-0-pro-fast-251015,
seedance-1-0-lite-t2v-250428, seedance-1-0-lite-i2v-250428.T2V models (*-t2v-*) do not accept image inputs; I2V models and
general *-pro-* models support a single reference image (first
frame). Pass the image positionally or set role: "first_frame".
T2V model IDs are automatically switched to the corresponding I2V
variant when an image is provided.Supported providerOptions keys: seed (number), draft (boolean -
forces 480p), camera_fixed (boolean).BytePlus Seedance 1.5
BytePlus Seedance 1.5
Requires the
@openclaw/byteplus-modelark
plugin. Provider id: byteplus-seedance15. Model:
seedance-1-5-pro-251215.Uses the unified content[] API. Supports at most 2 input images
(first_frame + last_frame). All inputs must be remote https://
URLs. Set role: "first_frame" / "last_frame" on each image, or
pass images positionally.aspectRatio: "adaptive" auto-detects ratio from the input image.
audio: true maps to generate_audio. providerOptions.seed
(number) is forwarded.BytePlus Seedance 2.0
BytePlus Seedance 2.0
Requires the
@openclaw/byteplus-modelark
plugin. Provider id: byteplus-seedance2. Models:
dreamina-seedance-2-0-260128,
dreamina-seedance-2-0-fast-260128.Uses the unified content[] API. Supports up to 9 reference images,
3 reference videos, and 3 reference audios. All inputs must be remote
https:// URLs. Set role on each asset - supported values:
"first_frame", "last_frame", "reference_image",
"reference_video", "reference_audio".aspectRatio: "adaptive" auto-detects ratio from the input image.
audio: true maps to generate_audio. providerOptions.seed
(number) is forwarded.ComfyUI
ComfyUI
Workflow-driven local or cloud execution. Supports text-to-video and
image-to-video through the configured graph.
fal
fal
Uses a queue-backed flow for long-running jobs. OpenClaw waits up to 20
minutes by default before treating an in-progress fal queue job as timed
out. Most fal video models
accept a single image reference. Seedance 2.0 reference-to-video
models accept up to 9 images, 3 videos, and 3 audio references, with
at most 12 total reference files.
Google (Gemini / Veo)
Google (Gemini / Veo)
Supports one image or one video reference. Generated-audio requests are
ignored with a warning on the Gemini API path because that API rejects
the
generateAudio parameter for current Veo video generation.MiniMax
MiniMax
Single image reference only. MiniMax accepts
768P and 1080P
resolutions; requests such as 720P are normalized to the closest
supported value before submission.OpenAI
OpenAI
Only
size override is forwarded. Other style overrides
(aspectRatio, resolution, audio, watermark) are ignored with
a warning.OpenRouter
OpenRouter
Uses OpenRouter’s asynchronous
/videos API. OpenClaw submits the
job, polls polling_url, and downloads either unsigned_urls or the
documented job content endpoint. The bundled google/veo-3.1-fast default
advertises 4/6/8 second durations, 720P/1080P resolutions, and
16:9/9:16 aspect ratios.Qwen
Qwen
Same DashScope backend as Alibaba. Reference inputs must be remote
http(s) URLs; local files are rejected upfront.Runway
Runway
Supports local files via data URIs. Video-to-video requires
runway/gen4_aleph. Text-only runs expose 16:9 and 9:16 aspect
ratios.Together
Together
Single image reference only.
Vydra
Vydra
Uses
https://www.vydra.ai/api/v1 directly to avoid auth-dropping
redirects. veo3 is bundled as text-to-video only; kling requires
a remote image URL.xAI
xAI
Supports text-to-video, single first-frame image-to-video, up to 7
reference_image inputs through xAI reference_images, and remote
video edit/extend flows.Provider capability modes
The shared video-generation contract supports mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:maxInputImages and maxInputVideos are
not enough to advertise transform-mode support. Providers should
declare generate, imageToVideo, and videoToVideo explicitly so live
tests, contract tests, and the shared video_generate tool can validate
mode support deterministically.
When one model in a provider has wider reference-input support than the
rest, use maxInputImagesByModel, maxInputVideosByModel, or
maxInputAudiosByModel instead of raising the mode-wide limit.
Live tests
Opt-in live coverage for the shared bundled providers:~/.profile, prefers
live/env API keys ahead of stored auth profiles by default, and runs a
release-safe smoke by default:
generatefor every non-FAL provider in the sweep.- One-second lobster prompt.
- Per-provider operation cap from
OPENCLAW_LIVE_VIDEO_GENERATION_TIMEOUT_MS(180000by default).
OPENCLAW_LIVE_VIDEO_GENERATION_FULL_MODES=1 to also run declared
transform modes the shared sweep can exercise safely with local media:
imageToVideowhencapabilities.imageToVideo.enabled.videoToVideowhencapabilities.videoToVideo.enabledand the provider/model accepts buffer-backed local video input in the shared sweep.
videoToVideo live lane covers runway only when you
select runway/gen4_aleph.
Configuration
Set the default video-generation model in your OpenClaw config:Related
- Alibaba Model Studio
- Background tasks - task tracking for async video generation
- BytePlus
- ComfyUI
- Configuration reference
- fal
- Google (Gemini)
- MiniMax
- Models
- OpenAI
- Qwen
- Runway
- Together AI
- Tools overview
- Vydra
- xAI