Home / Blog / AI Models

Lance 3B: one model for image & video, on a single RTX 4090

ByteDance's Lance is a 3B native unified multimodal model — image and video understanding, generation and editing in a single transformer. I ran all of its modes on one 24 GB RTX 4090: here are the results, the official benchmarks, the two code fixes it took to fit, and the real performance + VRAM numbers — including the max video duration and where it breaks.

Montage of Lance outputs on a 4090: a subject-driven astronaut portrait, a rainbow fox, an edited Santa-hat portrait, a sunset coastline frame and a red-panda surfer frame

TL;DR

Lance packs image/video understanding, generation and editing into one 3B model and — despite its size — matches or beats much larger unified models: 0.90 GenEval, 84.67 DPG, 7.30 GEdit-Bench, 85.11 VBench.

On a single 24 GB 4090 it needs two small patches to fit (a bf16-before-GPU load order and a flash-attn rotary fix). After that: text-to-image ~10 s/image, and full 10.08 s text-to-video at native 480p — t2v memory is flat at ~24 GB regardless of length. Understanding (VQA / OCR / captioning) is excellent and cheap.

The one real wall: multi-reference composition (combine two people, person + object, or a style-reference image) collapses to reproducing one input. Style transfer still works beautifully — just describe the style in text, not with a reference image.

What Lance is, in one paragraph

Lance ("Unified Multimodal Modeling by Multi-Task Synergy", ByteDance) is a 3 billion active-parameter model that does six things most stacks need six models for: text→image, text→video, image editing, video editing, and image / video understanding (visual question answering and captioning). The transformer backbone is a Qwen2-derived Mixture-of-Transformer with separate expert weights for "understanding" and "generation" tokens; it is trained from scratch (only the ViT and VAE encoders are pretrained) on a 128×A100 budget. The two visual front-ends are a Qwen2.5-VL ViT (for understanding/conditioning) and a Wan 2.2 video VAE (for generation). Two checkpoints ship: Lance_3B for images and Lance_3B_Video for video.

What makes it interesting for a home lab is the size. A 3B unified model that posts numbers next to 7B–20B systems is exactly the kind of thing that should run on a single consumer GPU — so I downloaded the weights to a data disk and pointed it at my RTX 4090. The rest of this article is what came out.

Setup. Weights (≈57 GB across both checkpoints + ViT + VAE) on a data disk; a uv venv on Python 3.11; torch 2.5.1+cu124, flash-attn 2.8.3, transformers 4.49, diffusers 0.29.1. All runs are bf16 with flash-attention and the KV-cache path enabled.

Seven modes, one set of weights

The CLI dispatches on a --task flag. Six modes are documented; a seventh — image_idip, subject-driven generation — lives in the code but isn't wired into the launcher, so I registered it to test reference-conditioned generation properly. Here's the map before we look at outputs.

ModeTask flagInput → OutputCheckpoint
Text → Imaget2iprompt → imageLance_3B
Image editingimage_editimage + instruction → imageLance_3B
Subject-drivenimage_idipreference image + prompt → imageLance_3B
Image understandingx2t_imageimage + question → textLance_3B
Text → Videot2vprompt → videoLance_3B_Video
Video editingvideo_editvideo + instruction → videoLance_3B_Video
Video understandingx2t_videovideo + question → textLance_3B_Video

Image generation & editing

Text-to-image is the most polished mode. The model renders clean in-image text (the cat's "STOP" sign on the project page is real), holds long compositional prompts together, and runs in about 10 seconds per 768² image on the 4090.

Text-to-image: anthropomorphic rainbow fox with stardust fur on a glowing fantasy grassland
t2i — "anthropomorphic rainbow fox, stardust fur, glowing grassland"
Text-to-image: cozy bookstore cafe at golden hour, sunlight through tall windows
t2i — "cozy bookstore café at golden hour, photorealistic"

Instruction-guided editing

image_edit takes a source image and a free-form instruction. It preserves identity and pose while applying local edits (objects, relighting) or a whole-image restyle. Two examples — adding accessories to my own portrait, and a full 3D-cartoon restyle of a friend's photo:

Input portrait, black and white
input
Edited: round black sunglasses and a red Santa hat added, identity preserved
"add round black sunglasses and a red Santa hat"
Input portrait of a man with a turban in a rustic setting
input
Edited: same man rendered as a 3D Pixar-style cartoon, setting preserved
"convert into a 3D Pixar-style cartoon render"

Subject-driven generation

Give Lance a single reference photo of a subject and a prompt, and it generates a new scene that keeps the subject's identity. This is the image_idip path (identity-preserving), and with one reference it works really well. Outputs are saved as a [reference | generated] pair:

Left: reference portrait. Right: the same man generated as an astronaut in space with Earth behind him, identity preserved
idip — "the same man as an astronaut, Earth behind him" · identity preserved [reference | generated]

Style transfer — describe it, don't reference it

Style transfer is just image_edit with the target style described in text. That is the "free-form manipulation" the project page shows, and it is excellent — the subject stays recognizable while the medium changes completely:

Portrait restyled as a flat bold-outlined comic cartoon
"change the style to a flat comic cartoon" (text)
Portrait restyled as a soft watercolor painting
"change the style to a watercolor painting" (text)
What does not work: handing the model a style-reference image (a "make this look like that" Two2One edit). It simply reproduces the reference instead of transferring its style onto the subject — the same failure mode as all multi-reference composition (see Where it breaks).

Image & video understanding

The understanding modes (x2t_image, x2t_video) take a visual plus a question and emit text. They're fast (~3 s per image) and accurate — including OCR and symbol recognition. A few real answers, verbatim:

① Symbol & scene recognition

x2t_image
Iwo-Jima-style photo of soldiers raising an Egyptian flag
input

Q · what are the people doing, and what flag? "The people in the image are soldiers, and they are raising an Egyptian flag atop a destroyed building."

② OCR — reading a poster

x2t_image
Framed poster reading Diagrams as Code
input

Q · read the text and describe the image "…a poster that reads \"DIAGRAMS AS CODE\" in large blue letters. The design is simple yet striking…"

③ Video captioning — a film scene

x2t_video
input · a peace-negotiation scene (3 s clip)

Q · describe the people, clothing and setting "…an elderly man with a white beard and white headwrap… sitting in a room with a solemn expression… a white tunic with a high collar. The room has a wooden door and a window with white curtains… deep in thought or concerned about something."

Note on video understanding

The descriptions are accurate on people, attire, setting and mood — but the model does not name proper nouns (it describes the scene, not "Omar Mukhtar"). On 24 GB this mode needs short clips at a reduced ViT resolution: a ~10 s clip at 480p tries to allocate a 33 GB attention mask and OOMs, so I cap at ~3 s and video_360p.

Video — generation & editing

Text-to-video runs from the Lance_3B_Video checkpoint at 848×480, 12 fps. Quality is strong and motion is coherent. Two clips — a tropical sunset coastline and a red-panda surfer:

t2v — "tropical coastline at sunset, moving waves, swaying palms"
t2v — "red panda riding a wooden surfboard on a wave"

Video editing

video_edit recolors subjects and replaces backgrounds while keeping motion. It is the most memory-hungry mode (it holds the reference video and the generated target at once), so on 24 GB it runs at video_360p on short clips. Input → edited:

input
"make the car bright red, background a snowy mountain road"
input
"change the background to a fairytale castle by a lake at sunset"

The official benchmarks — punching above 3B

The reason a 3B unified model is worth your disk space is the score-to-size ratio. From the Lance paper, here is where it lands against larger unified models on the four headline suites (Lance highlighted):

ModelParamsGenEval ↑DPG ↑GEdit-Bench ↑VBench ↑
BAGEL7B0.8885.076.52
Show-o27B0.7686.1481.34
InternVL-U1.7B0.8585.186.66
TUNA7B / 1.5B0.9086.7684.06
Qwen-Image20B0.8788.32
Wan2.1-T2V14B83.69
🌟 Lance3B0.9084.677.3085.11

The standout is image editing: 7.30 on GEdit-Bench beats BAGEL (6.52) and InternVL-U (6.66) by a clear margin, and video generation at 85.11 VBench edges out the 14B Wan2.1-T2V. On GenEval it ties the 7B TUNA at 0.90 and beats the 20B Qwen-Image. DPG (84.67) is the one suite where the bigger models keep a small lead.

Running it on a 24 GB 4090 — two fixes

The README asks for a 40 GB GPU. The 4090 has 24. It fits, but only after two changes — both worth knowing if you try this yourself.

① Load in bf16 before moving to the GPU

The image checkpoint is 6.19 B parameters stored in fp32 (~24 GB). The stock loader moves the fp32 model to the GPU and only then casts to bf16 — which OOMs a 24 GB card at load time. Casting on CPU first, then moving the ~12 GB bf16 model to the GPU, fixes it:

# inference_lance.py — cast on CPU, defer the GPU move
model = model.to(dtype=torch.bfloat16)     # was: model.to(DEVICE)  [fp32 -> OOM]
# ... load checkpoint, resize embeddings ...
model = model.to(device=DEVICE, dtype=torch.bfloat16)   # GPU holds bf16 only

② Swap the flash-attn rotary kernel

flash-attn 2.8.3's Triton apply_rotary_emb calls torch.library.wrap_triton, which only exists in torch ≥ 2.6 (this repo pins 2.5.1). Any ViT pass — every edit, every understanding call — crashes with AttributeError. The pure-torch variant has identical math and is a drop-in; the fast attention kernel is untouched:

# modeling/vit/qwen2_5_vl_vit.py
from flash_attn.layers.rotary import apply_rotary_emb_torch as apply_rotary_emb

With those in place, plus PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to keep video generation from fragmenting, everything runs.

Performance — every mode, measured

All seven modes, run back-to-back on the 4090. "Load" is checkpoint load + bf16 cast + GPU move; "Gen" is the denoise/decode (or autoregressive decode for understanding). Peak VRAM is the max sampled during the run.

ModeCheckpointLoad (s)Gen (s)Per examplePeak VRAM
t2iLance_3B422010.0 s16.3 GB
image_editLance_3B672512.5 s16.6 GB
image_idipLance_3B435714.2 s20.0 GB
x2t_imageLance_3B41133.2 s16.2 GB
t2v (33f)Lance_3B_Video11415577.5 s24.0 GB
x2t_videoLance_3B_Video1027738.5 s21.7 GB
video_editLance_3B_Video102115.5 s17.4 GB

Reading the table

Image modes are comfortable (16–20 GB) and fast (3–14 s each). The video model loads slower (the checkpoint is larger) and t2v is the only mode that pushes the 24 GB ceiling. Model load dominates short jobs — keep the process warm if you're generating in bulk.

25 50 75 100 t2i 16.3 GB image_edit 16.6 GB image_idip 20.0 GB x2t_image 16.2 GB t2v 24.0 GB x2t_video 21.7 GB video_edit 17.4 GB
Peak VRAM (% of 24.6 GB)Time per example (% of slowest, 77.5 s)
resource footprint per mode — VRAM (red) vs time-per-example (blue), each normalized to its own ceiling

What the shape says

t2v is the lone outlier — it is the only mode that reaches both rims at once (~98 % of VRAM and 100 % of the time axis). The four image modes collapse into a small inner cluster (66–81 % VRAM, under 19 % time): cheap and interactive. x2t_image is the fastest point on the whole chart at 3.2 s/image (4 % of the time axis), while x2t_video is the second memory peak (88 %) at mid-speed. The geometry makes the operational rule visual: everything on the image checkpoint is real-time-ish; only video generation is a "launch it and walk away" job.

Maximum video duration on 24 GB

Output is hard-coded to 12 fps and the model caps at 121 frames, so the absolute ceiling is 10.08 s. The surprise: t2v memory is flat at ~24 GB regardless of frame count — flash-attention keeps the long-sequence denoise cheap and the Wan VAE decodes frame-by-frame — so the full 121 frames fit even at native 480p. I swept it to confirm:

ResolutionFramesDurationPeak VRAM (MiB)Result
848×480574.75 s23960OK
848×480736.08 s24035OK
848×480978.08 s23991OK
848×48012110.08 s24043OK · max
640×38412110.08 s21793OK
512×28812110.08 s19445OK
Image-to-video is not exposed in the released inference code — the frame-conditioning hooks (frame_condition_idx, an ff2v "first-frame→video" prompt) exist but aren't wired to any task. The closest routable capability is video_idip (image-reference → video), which shares the same 10.08 s ceiling. True first-frame i2v would need the conditioning path wired up.
25 50 75 100 57f · 480p 4.75 s 73f · 480p 6.08 s 97f · 480p 8.08 s 121f · 480p 10.08 s 121f · 384p 10.08 s 121f · 288p 10.08 s
Peak VRAM (% of 24.6 GB)Duration reached (% of 10.08 s cap)
VRAM stays pinned near the rim while duration grows — then resolution cuts (last two axes) only buy headroom

Reading the two polygons

The red (VRAM) polygon is almost a flat arc across the four 480p axes — 97.5 %, 97.8 %, 97.7 %, 97.9 % at 57→121 frames — so memory barely moves as the clip lengthens. The green (duration) polygon expands from 47 % to 100 % over those same axes: more seconds for the same memory. Only the last two axes (384p, 288p) pull the red polygon inward, and the green stays pinned at 100 % — proof that dropping resolution buys headroom you don't need, since native 480p already hits the 10.08 s frame cap.

Where it breaks: multi-reference composition

The one capability that consistently fails is composing two reference images into one output — whether that's two people, a person plus an object, or a content image plus a style image. The model anchors on one reference and reproduces or blends toward it. Three attempts, all on the correct subject-driven path:

Two reference portraits and a generated image where both output figures take on one reference's look
multi-person (me + a friend) — both output faces drift toward one reference [ref · ref · generated]
Person plus shoe references; output just re-renders the portrait
person + object — ignores "wearing the shoes"
Content portrait plus comic style image; output copies the style image
content + style image — copies the style reference

The cause is mechanical: the position-id logic shifts all reference tokens to the same coordinate offset, so multiple references overlap in position space and the model can't keep them distinct. It is not a prompt problem and not a bug I introduced — multi-reference composition simply isn't a trained, exposed capability of the released 3B weights (the docs advertise only single-input tasks). The practical rule: one image reference for subject-driven generation; text instructions for style and edits.

References