Home / Blog / AI Models

Lens on a single RTX 4090: quantizing the DiT to run 1440 without offload

Microsoft's Lens is a 3.8B text-to-image model that pairs a 4-bit GPT-OSS-20B encoder with the FLUX.2 VAE. Getting it onto one 24 GB RTX 4090 was easy; getting 1440 to run without CPU offload was not. Here is the honest story — the OOM wall, two dead-ends, the FP8 fix that worked — with a gallery, exact configs and prompts, and a candid quality verdict on every image.

A lone astronaut on a red desert dune at dusk under distant moons, generated by Lens at 1440 on an RTX 4090
Lens · 1440 · 2:1 · 28 steps · cfg 5.0 · FP8 DiT, no offload — "a lone astronaut on a windswept red desert dune at dusk, two distant moons, teal-and-orange grade"

TL;DR

Lens (Microsoft) is a 3.8B-parameter text-to-image DiT that conditions on multi-layer features from a GPT-OSS-20B encoder and decodes through the FLUX.2 VAE. The encoder ships in MXFP4 (4-bit), which is the whole reason a 20B model fits at all (~11 GB).

On a 24 GB RTX 4090 it runs out of the box with CPU offload at any resolution. Running it without offload works up to 1024 — but only barely, peaking at ~24 GB. Every 1440 image OOMs in bf16, because the 4-bit encoder stays resident and the bf16 DiT eats the rest.

The fix is to quantize the one component that is still full precision: the DiT. A weight-only FP8 pass (torchao) halves it from ~7.6 GB to ~3.8 GB, dropping the 1440 peak from ~27 GB (OOM) to ~23 GB (fits) — with no visible quality loss. After that, Lens does 1440 fully on the GPU, no offload.

What Lens is, in one paragraph

Lens is a 3.8B-parameter foundational text-to-image model from Microsoft, built for training efficiency and high-resolution generation. Architecturally it is unusual: instead of a CLIP/T5 text encoder, it conditions a 48-block double-stream MMDiT on multi-layer hidden states pulled from a GPT-OSS-20B language model (layers 5/11/17/23, concatenated), and it decodes latents through the FLUX.2 VAE. Three checkpoints ship: the RL-tuned microsoft/Lens (20 steps, cfg 5.0), a distilled 4-step Lens-Turbo, and the supervised Lens-Base.

The interesting consequence for a local run is the memory shape. That GPT-OSS-20B encoder is enormous for a text encoder — but it is shipped in MXFP4 (4-bit), so it lands at roughly 11 GB instead of ~40 GB. The denoiser and VAE are comparatively small. So the question on a 24 GB card is never "does it load" — it is "how much room is left for the diffusion activations once that 4-bit encoder is sitting on the GPU."

Setup. Pop!_OS 22.04, one 24 GB RTX 4090 (Ada, sm_89), CUDA 12.9 driver. A uv venv on Python 3.12 with torch 2.11+cu126, diffusers 0.38, transformers 5.8. All weights cached on a data disk (/d/hugging_face_cache) — the full microsoft/Lens repo is ~29 GB.

Running it on a 4090 — the easy part

The reference path just works. Load the MXFP4 encoder, assemble the pipeline, move it to CUDA, and generate. On a 4090 a 1024² image at 20 steps takes ~9 s once the model is warm. There are two non-obvious setup details worth flagging up front, because both cost time before the first image appears:

  • A pre-release dependency. diffusers 0.38 pins a pre-release safetensors (0.8.0rc), so the install needs uv pip install --prerelease=allow — a plain install fails to resolve.
  • MXFP4 on Ada. MXFP4 matmul kernels officially target Hopper (sm_90); the 4090 is Ada (sm_89). In practice the transformers + kernels path still runs the 4-bit encoder fine here (you get a harmless kernels deprecation warning). The naive "dequantize to bf16" escape hatch is a trap — a dequantized 20B encoder is ~40 GB and will not fit in 24 GB at all.
# install (note the pre-release flag)
uv pip install torch==2.11.0+cu126 torchvision==0.26.0+cu126 \
    --index-url https://download.pytorch.org/whl/cu126
uv pip install --prerelease=allow -r requirements.txt

# generate (with CPU offload — the out-of-the-box path, any resolution)
python inference.py --prompt "a cinematic mountain lake at sunrise" \
    --base_resolution 1440 --aspect_ratio 16:9 --steps 20 --cfg 5.0 --offload

The challenge: 1440 without offload OOMs

CPU offload works, but it is not free — it shuttles whole modules between CPU and GPU, and I wanted the encoder and denoiser to simply stay resident. Dropping --offload and calling pipe.to("cuda") exposes the real budget. The 4-bit encoder (~11 GB) plus the bf16 DiT (~7.6 GB) is ~19 GB before any image, leaving ~5 GB for activations. That is enough for 1024 (it peaks right at the 24 GB ceiling) but not for 1440: every 1440 image hit a hard CUDA out of memory, the bf16 path needing ~27 GB.

Before the fix that worked, two reasonable-sounding ideas failed, and they are worth recording because both look correct on paper:

Dead-end #1 — "evict the encoder after encoding." The encoder is only needed at the start, so encode every prompt, then pipe.text_encoder.to("cpu") to free ~11 GB before denoising. It frees almost nothing: MXFP4 quantized weights don't release on .to("cpu"), and moving the module flips pipe._execution_device to CPU (latents then try to allocate on CPU → a generator/device mismatch).
Dead-end #2 — "catch the OOM and retry smaller." Wrapping generation in try/except OutOfMemoryError and retrying at 1024 still OOMs on the retry. The caught exception's traceback keeps the failed forward's tensors alive, so empty_cache() frees nothing. The retry only works if you leave the except block first, then gc.collect() + empty_cache() before trying again.

Both dead-ends point at the same conclusion: you cannot claw the 11 GB encoder back at runtime. If 1440 is going to fit, the savings have to come from the only component still at full precision — the DiT.

The fix: weight-only FP8 on the DiT

The text encoder is already 4-bit; there is nothing left to take there. The DiT, however, was running in bf16 (~7.6 GB). Quantizing it weight-only to FP8 via torchao (diffusers exposes this through TorchAoConfig) roughly halves it to ~3.8 GB. The 4090 has FP8 tensor cores, and weight-only FP8 uses native PyTorch ops — so it runs even though two of torchao's Hopper-only prebuilt kernels fail to load (they aren't needed here).

That ~3.8 GB saving is far more than the ~1–2 GB by which 1440 was overshooting. The 1440 peak drops from ~27 GB (OOM) to ~23 GB, and Lens generates 1440 fully on the GPU, no offload. I wired it into a --quant_transformer flag; the model's own attention already uses memory-efficient SDPA, so the only change needed was the weight precision.

uv pip install torchao

# 1440, no offload, on a 24 GB 4090 — FP8-quantized DiT + MXFP4 encoder
python inference.py --prompt "..." \
    --base_resolution 1440 --aspect_ratio 16:9 --steps 30 --cfg 4.5 \
    --quant_transformer float8_weight_only        # NOTE: no --offload

The config matrix I settled on

Resolution / goalRecommended config (RTX 4090, 24 GB)Peak VRAM
≤ 1024, fastestno offload, bf16 DiT~23.7–24.0 GB
1440, no offload--quant_transformer float8_weight_only~22.9–23.4 GB
1440, bf16 fidelitybf16 DiT + --offloadfits (slower)
1440, bf16, no offloadnot possible~27 GB → OOM
Why this generalizes. Any model that hangs a big quantized LLM off a smaller bf16 denoiser has the same shape: the encoder is already as small as it gets, so the headroom you need at high resolution has to come from quantizing the denoiser. FP8 weight-only is the gentlest lever; int8_weight_only saves about the same with similar quality, and int4_weight_only saves more but starts to cost output quality.

What it actually produces — eight categories

Every image below was generated no offload at 1024 (the peak-VRAM edge case), one per category, to see where Lens is strong and where it isn't. Each caption is the exact configuration — base resolution, aspect ratio, steps, cfg and seed — followed by the prompt. Same prompt + seed + config is reproducible.

Photorealistic — portrait

Strong

"Photorealistic studio portrait of an elderly fisherman with a deeply weathered face, white stubble, bright yellow rain hat, dramatic Rembrandt lighting, 85mm f/1.4, hyper-detailed skin pores, sharp catchlights"

Photorealistic studio portrait of an elderly fisherman in a yellow rain hat
1024 · 3:4 · 30 steps · cfg 5.0 · seed 12 · no offload

Honest take

This is Lens at its best. Skin texture, the catchlights, the wet sheen on the hat — it reads as a real photograph, with no tell-tale "AI plastic" skin. Portraits and faces are clearly a strength.

Cinematic — sci-fi still

Good, minor prompt drift

"Cinematic film still, wide anamorphic shot of a lone astronaut on a windswept red desert dune at dusk, two distant moons, teal-and-orange color grade, volumetric dust, lens flare, shot on ARRI Alexa"

Cinematic astronaut on a red desert dune at dusk with distant moons
1024 · 2:1 · 28 steps · cfg 5.0 · seed 11 · no offload

Honest take

The mood, grade and lens flare are spot-on and genuinely cinematic. One honest miss: the prompt said two moons and the model rendered three. Exact counts are not Lens's strong suit — a known failure mode of most T2I models.

Landscape / nature

Looks painterly

"A sweeping landscape photograph of an alpine valley at golden hour, turquoise river, snow-capped peaks, pine forest, grazing deer, soft mist, ultra-wide vista, National Geographic style"

An alpine valley at golden hour with a river, peaks and deer
1024 · 16:9 · 30 steps · cfg 5.0 · seed 13 · no offload

Honest take

Beautiful, but it reads as a digital painting, not a photo — soft, slightly illustrative lighting. The phrase "National Geographic style" nudged it toward render territory. This one bugged me enough that I reshot it properly later in the article (see the prompt lesson).

Text / typography

Strong

'A vintage hand-lettered coffee-shop chalkboard reading "THE DAILY GRIND" in ornate white serif, with "Fresh Roast - Open 7am" beneath, coffee-bean illustrations, realistic slate texture'

A chalkboard sign reading THE DAILY GRIND, Fresh Roast Open 7am
1024 · 1:1 · 26 steps · cfg 5.0 · seed 14 · no offload

Honest take

Text rendering is a real strength here — both lines are spelled correctly and the styling matches the brief. This is where the GPT-OSS text encoder seems to earn its keep; many T2I models would garble at least one word.

Comics

Strong

'A vibrant comic-book panel, masked superhero in a blue-and-red suit leaping between rooftops at night, bold black outlines, halftone shading, a yellow speech bubble saying "Not on my watch!"'

Comic-book panel of a superhero leaping with a Not on my watch speech bubble
1024 · 4:3 · 26 steps · cfg 5.0 · seed 15 · no offload

Honest take

Convincing comic styling — inks, halftones, dynamic pose — and the speech bubble text is correct. Short in-image text plus a defined art style is clearly in Lens's comfort zone.

Caricature

On-style

"A humorous exaggerated caricature of a cheerful Italian chef, enormous bushy mustache, huge round cheeks, tiny body, oversized chef hat, holding steaming spaghetti, bold linework with watercolor shading"

Caricature of a cheerful Italian chef holding spaghetti
1024 · 3:4 · 24 steps · cfg 5.0 · seed 16 · no offload

Honest take

The exaggeration reads correctly as caricature rather than cartoon — proportions pushed, features inflated, line and watercolor handling on point. A solid result with no special prompting tricks.

Diagram / infographic

Surprisingly strong

'A clean educational infographic of the water cycle, labelled "EVAPORATION", "CONDENSATION", "PRECIPITATION", "COLLECTION" with arrows in a circular flow, flat vector style, pastel palette'

A water-cycle infographic with four correctly spelled labels
1024 · 3:2 · 26 steps · cfg 5.0 · seed 17 · no offload

Honest take

All four labels are spelled correctly and placed on a coherent circular flow — unusual for a general T2I model, which usually turns diagram labels into gibberish. Not a substitute for a design tool, but genuinely useful for rough explainer visuals.

Illustration

Strong

"A whimsical children's-book watercolor of a small red fox reading a book under a glowing paper lantern in an autumn forest, soft brush strokes, warm orange-and-teal palette, fireflies, cozy storybook mood"

Watercolor illustration of a red fox reading under a lantern
1024 · 1:1 · 22 steps · cfg 5.0 · seed 18 · no offload

Honest take

Exactly the requested storybook watercolor — soft edges, warm palette, cohesive mood. Illustrative styles come out reliably and need the fewest steps (22 was plenty).

1440 — fully on the GPU, no offload

With the FP8 DiT, the three categories that benefit most from resolution run at 1440 with the encoder and denoiser both resident. Captions show the measured peak VRAM — all comfortably under the 24 GB ceiling that bf16 blew past. The extra resolution shows: the portrait in particular gains real micro-detail in the skin and beard versus its 1024 version above.

Fisherman portrait at 1440 with FP8 DiT, no offload
1440 · 3:4 · 30 steps · cfg 5.0 · FP8 DiT · peak 22.9 GB
Alpine valley at 1440 with FP8 DiT, no offload
1440 · 16:9 · 30 steps · cfg 5.0 · FP8 DiT · peak 23.2 GB
Cinematic astronaut at 1440 with FP8 DiT, no offload
1440 · 2:1 · 28 steps · cfg 5.0 · FP8 DiT · peak 23.4 GB
A glowing glass greenhouse in a snowy forest at night, 1440 FP8
1440 · 16:9 · 20 steps · cfg 5.0 · FP8 DiT · "a glass greenhouse glowing at night in a snowy forest"

Quantization quality check

Side by side with the bf16 1024 versions, I could not pick out an "FP8 look" — no banding, no color shift, no loss of fine texture. For weight-only FP8 on a diffusion transformer that matches expectations: the activations stay bf16, only the stored weights are 8-bit. The win is pure headroom, paid for with essentially no visible quality.

A prompt lesson: "photo" vs "painting"

The alpine landscape in the gallery came out painterly, and the culprit was the prompt. "National Geographic style" reads to the model as an aesthetic, not a camera. Swapping the art language for concrete photographic cues — a body, a lens, an aperture, "RAW photo", "high dynamic range", "no haze", "extreme detail" — and nudging cfg down to 4.5 pushes Lens firmly into photoreal territory.

Painterly alpine valley landscape
BEFORE · "...National Geographic style" · cfg 5.0 — reads as a painting
Photorealistic mountain range with a mirror-still glacial lake
AFTER · "RAW photo, Nikon D850, 24-70mm f/11, HDR, no haze" · cfg 4.5 — reads as a photo

"A photorealistic landscape photograph of a rugged mountain range at sunrise, sharp granite peaks dusted with fresh snow, a still glacial lake mirroring the peaks, pine forest along the shore, crisp clear air with no haze, soft warm side light, extreme detail, high dynamic range, RAW photo shot on a Nikon D850 with a 24-70mm lens at f/11, professional nature photography"

The rule of thumb

For photorealism with Lens: name a camera + lens + aperture, say "RAW photo / photograph", add "high dynamic range / extreme detail / no haze", and avoid the words "style", "art" and "painting". A slightly lower cfg (~4.0–4.5) tends to look more natural for photos than the default 5.0+.

Honest verdict

Lens is a genuinely strong text-to-image model that punches above its 3.8B size, and the GPT-OSS encoder clearly pays off where it matters most: text rendering, labelled diagrams, and prompt following are all better than I expected from a model this size. Portraits, illustration and comic styling are reliably good. Where it stumbles is the usual T2I weak spot — exact object counts ("two moons" became three) — and, like any model, it does what the prompt literally says, so vague aesthetic words can quietly steer you away from photorealism.

On hardware, the headline is the memory story. The 4-bit encoder is a clever way to ship a 20B conditioner, but it also fixes ~11 GB of your budget in place. On a 24 GB 4090 that makes 1024 the comfortable no-offload ceiling out of the box — and a one-line FP8 quantization of the DiT is what turns "1440 needs offload" into "1440 runs fully on the GPU," with no quality I could see. That is the most reusable takeaway here: when the encoder is already quantized, quantize the denoiser.

Practical recipe. ≤1024 → plain no offload. 1440 → add --quant_transformer float8_weight_only (no offload). Want bf16 fidelity at 1440 → use --offload instead. For photos, prompt a camera and lens, not a "style".

References