Kling v3 Video Generation - ToAPIs Documentation

Async task API, returns a task ID after submission
Supports text-to-video, image-to-video, explicit first/last frame control, and audio video
mode=std maps to 720P, mode=pro maps to 1080P
audio=true generates an audio video and is billed as Sound
Text-to-video supports 15 seconds; image-to-video supports up to 10 seconds

Use publicly accessible image URLs. Do not pass base64 image data. Upload local images with the Upload Image API first.

Authorization

string

required

All endpoints require Bearer Token authentication.

Authorization: Bearer YOUR_API_KEY

Request Parameters

model

string

required

Video generation model name, fixed as kling-v3.

prompt

string

required

Text prompt. Describe the subject, action, scene, camera movement, and style.

mode

string

default:"std"

Generation mode.

std - standard mode, 720P
pro - professional mode, 1080P

duration

integer

default:"5"

Video duration in seconds.Options: 5, 10, 15

15 seconds is text-to-video only. Requests with input images support up to 10 seconds.

aspect_ratio

string

default:"16:9"

Video aspect ratio. Common values: 16:9, 9:16, 1:1

reference_images

string[]

Normal reference images.

These images are treated as references only
They are not automatically converted into first/last frames
Use image_with_roles for explicit frame control

image_with_roles

object[]

Explicit image-role array for frame control and mixed inputs.

Show Show image_with_roles object fields

url

string

required

Publicly accessible image URL.

role

string

required

Image role.Supported values:

first_frame
last_frame
reference
reference_image

last_frame is only sent when explicitly declared in image_with_roles. The system no longer infers the last frame from reference_images[1].

audio

boolean

default:"false"

Whether to generate an audio video.

metadata

object

Extended parameters.

Show Show metadata fields

negative_prompt

string

Negative prompt describing content to avoid.

watermark

boolean

Whether to add watermark.

Input Rules

Input shape	Behavior
`reference_images` only	Normal references
`image_with_roles` with only `first_frame` / `last_frame`	Frame control
Both fields used together, or roles include both frame and reference semantics	Mixed mode

Examples

Text-to-Video

{
  "model": "kling-v3",
  "prompt": "A golden cat running on a sunlit meadow, slow motion, cinematic quality",
  "mode": "std",
  "duration": 5,
  "aspect_ratio": "16:9"
}

Image Reference

{
  "model": "kling-v3",
  "prompt": "Use the reference character and animate a subtle smile",
  "reference_images": ["https://example.com/reference.jpg"],
  "mode": "std",
  "duration": 5
}

First and Last Frame Control

{
  "model": "kling-v3",
  "prompt": "The city naturally transitions from day to night",
  "image_with_roles": [
    { "url": "https://example.com/day.jpg", "role": "first_frame" },
    { "url": "https://example.com/night.jpg", "role": "last_frame" }
  ],
  "mode": "pro",
  "duration": 5
}

Mixed Reference and Frame Input

{
  "model": "kling-v3",
  "prompt": "Keep the character identity consistent while transitioning scenes",
  "reference_images": ["https://example.com/character-reference.jpg"],
  "image_with_roles": [
    { "url": "https://example.com/start-scene.jpg", "role": "first_frame" },
    { "url": "https://example.com/end-scene.jpg", "role": "last_frame" }
  ],
  "mode": "pro",
  "duration": 5
}

Audio Video

{
  "model": "kling-v3",
  "prompt": "A singer performing on stage, crowd cheering, flashing lights",
  "mode": "std",
  "duration": 5,
  "audio": true
}

Video generation is asynchronous. Use the Get Video Task Status endpoint to query progress and results.

​Authorization

​Request Parameters

​Input Rules

​Examples

​Text-to-Video

​Image Reference

​First and Last Frame Control

​Mixed Reference and Frame Input

​Audio Video

Authorization

Request Parameters

Input Rules

Examples

Text-to-Video

Image Reference

First and Last Frame Control

Mixed Reference and Frame Input

Audio Video