Home / Blog / AI Models

Ideogram 4: structured JSON prompting, hands-on, on a single RTX 4090

Ideogram's first open-weight model is a 9.3B design-focused text-to-image foundation model with a trick nobody else has: it speaks JSON. I ran it locally in nf4 on one 24 GB RTX 4090 — this is the full tour: the architecture, both quantization formats, all three quality modes, the caption schema with bounding-box layout control, LLM-powered magic prompts, a 14-image style gallery with honest quality scores (Arabic included — artifacts and all), and a 21-run speed benchmark.

Collage of Ideogram 4 sample outputs spanning photorealism, illustration, typography and poster design
Sample collage — Ideogram 4 repository (ideogram-oss/ideogram4)

TL;DR

Ideogram 4 is a 9.3B from-scratch foundation model — a single-stream DiT with a Qwen3-VL text encoder — and the top open-weight image model on Design Arena. Its superpower is structured JSON prompting: exact text strings, hex palettes, and per-element bounding boxes that the model actually respects.

On a 24 GB 4090 in nf4: ~21 s per 1024² draft (Turbo), ~32 s default, ~72 s top quality — and native 2048² fits at 23.3 GB peak with a text-encoder offload. Time scales linearly with steps, super-linearly with pixels (4× area ≈ 6.3× time).

Quality across 14 styles averaged 8.6 / 10: photorealism and bbox-driven diagrams are superb; small-text labels wobble; Arabic body text garbles — keep Arabic short and display-sized.

🎬 Prefer video? This article has a 22-minute narrated tutorial twin on YouTube — same gallery, benchmarks and analysis: youtu.be/jblBfcYn6s0

What Ideogram 4 is

Ideogram has been the "text-in-images" lab since its first model, and Ideogram 4 is its first open-weight release. It is not a fine-tune of an existing checkpoint — it's a 9.3B-parameter foundation model trained from scratch, aimed squarely at design work: typography, layout, posters, diagrams, brand color.

The receipts are public. On Design Arena's image Elo board it is the top-ranked open-weight model, trailing only proprietary GPT and Gemini image models; filtered to open weights it leads by a commanding margin. In ContraLabs' blind typography evaluation, ten professional designers picked it first 47.9% of the time — ahead of Nano Banana 2 (30.0%), FLUX.2 [max] (15.5%) and Grok Imagine (15.0%).

Design Arena overall image Elo leaderboard with Ideogram 4 as the top open-weight model
Design Arena leaderboard — chart from the Ideogram 4 repository

Parameter efficiency is the other headline: at 9.3B it beats much larger open models on text rendering — Qwen-Image (20B), FLUX.2 [dev] (32B), even HunyuanImage 3.0's 80B MoE.

Scatter plot of text-rendering score vs parameter count, Ideogram 4 leading all open-weight models at 9.3B
Text rendering vs. parameters — chart from the Ideogram 4 repository benchmarks

Architecture: one stream, a VLM encoder, flow matching

Three design decisions define the model. First, it's a fully single-stream Diffusion Transformer: text tokens and image latent tokens are concatenated into one sequence and processed by the same 34 layers — no separate branches, so text and pixels interact at every depth. Positioning uses 3D multimodal RoPE so both modalities share one coordinate space.

Second, the text encoder is not CLIP or T5 — it's Qwen3-VL-8B-Instruct, a full vision-language model run in text-only mode. Hidden states are extracted from 13 layers (0, 3, 6 … 35) and concatenated, handing the DiT a multi-scale representation from surface tokens to deep semantics. I suspect this is a big part of why JSON captions work so well: a VLM has seen structured descriptions of images.

Third, generation is flow matching: the network predicts a velocity field and an Euler sampler integrates from noise to image. Classifier-free guidance is dual-branch with two separate transformers — and the negative branch is asymmetric, processing image tokens only, which makes the unconditional pass cheaper than the conditional one.

Pipeline diagram: Qwen3-VL text encoder with 13 stacked layers feeding a 34-layer single-stream DiT with a mixed text and image token row, then a KL VAE decoder producing the image
The pipeline at a glance — illustration: GPT Image 2 (OpenAI)

nf4 vs fp8 — pick your format

The weights ship in two quantizations on Hugging Face (gated — accept the license and set HF_TOKEN). Everything in this article ran nf4.

 nf4 (what I use)fp8
Encodingbitsandbytes NormalFloat-4float8 e4m3, weight-only
HardwareCUDA onlyAny device — no FP8 hardware needed (activations stay bf16)
Resident VRAM~16 GB total (2 DiT branches + Qwen3-VL + VAE)larger
DiffusersYesNo
Best for24 GB NVIDIA cards — the 4090 sweet spotMac / CPU fallback
Memory note. ~16 GB resident leaves only ~7 GB of activation headroom on a 4090. For native 2048² I offload the 5.5 GB text encoder to CPU during sampling (it runs exactly once, before the loop) and pre-cast the text features to bf16 — peak lands at 23.3 GB, just under the ceiling.

Three quality modes

Sampling is controlled by named presets — same model, different step budgets. All three share one clever recipe: most steps run at guidance 7 for adherence, then the last few drop to guidance 3 — gentle "polish" passes that clean artifacts right before the image resolves.

PresetSteps1024² on the 4090Use for
V4_TURBO_121220.9 sDrafts, exploration
V4_DEFAULT_202031.8 sDaily work (this article's gallery)
V4_QUALITY_484872.0 sFinals, small typography

The entire registry is ~25 lines of source — with one gotcha worth knowing:

# src/ideogram4/sampler_configs.py
# guidance_schedule is in loop-INDEX order:
# index 0 is the LAST (polish) step.
PRESETS = {
  "V4_QUALITY_48": SamplerParameters(
    num_steps=48,
    guidance_schedule=(3.0,) * 3 + (7.0,) * 45,
    mu=0.0, std=1.5,
  ),
  "V4_DEFAULT_20": SamplerParameters(
    num_steps=20,
    guidance_schedule=(3.0,) * 2 + (7.0,) * 18,
    mu=0.0, std=1.75,
  ),
  "V4_TURBO_12": SamplerParameters(
    num_steps=12,
    guidance_schedule=(3.0,) * 1 + (7.0,) * 11,
    mu=0.5, std=1.75,
  ),
}

The JSON caption schema — the model's native language

Ideogram 4 was trained on structured JSON captions, not free text. Plain sentences work, but a schema-compliant JSON object unlocks the controllability this model is famous for. Three top-level keys:

  • high_level_description — one summary sentence.
  • style_description — aesthetics, lighting, medium, an optional hex color_palette, and exactly one of photo (photographic) or art_style (everything else).
  • compositional_deconstruction — a background plus an elements list. Each element is obj or text, with a description, an exact text string when applicable, and an optional bbox as [ymin, xmin, ymax, xmax] in a 0–1000 space independent of output resolution.

Here's a real caption from my library, and the image it produced — every service icon landed inside its declared box:

Prompt → output: AWS serverless architecture

JSON caption
caption.json — the exact prompt sent to the model
{
  "high_level_description": "An AWS cloud architecture diagram of a serverless web application: users flow through CloudFront and API Gateway to Lambda functions and a DynamoDB table, with S3 for static assets.",
  "style_description": {
    "aesthetics": "official cloud documentation style, orderly, easy to scan",
    "lighting": "flat diagram lighting with no shadows",
    "medium": "vector architecture diagram",
    "art_style": "AWS-style architecture diagram with orange and dark-blue service icons in rounded squares, a dashed cloud boundary box, thin grey connector arrows with small labels",
    "color_palette": [
      "#FFFFFF",
      "#FF9900",
      "#232F3E",
      "#E8EAED",
      "#527FFF"
    ]
  },
  "compositional_deconstruction": {
    "background": "A white canvas containing one large dashed rounded rectangle labeled AWS Cloud occupying the right four-fifths of the frame, straight grey arrows connecting the services left to right, each arrow carrying a tiny request label.",
    "elements": [
      {
        "type": "text",
        "bbox": [
          50,
          300,
          140,
          700
        ],
        "text": "Serverless Web Application",
        "desc": "Dark slate sans-serif diagram title centered along the top edge."
      },
      {
        "type": "obj",
        "bbox": [
          420,
          30,
          620,
          150
        ],
        "desc": "A user icon: a simple dark circle-and-shoulders pictogram above the label Users, placed outside the cloud boundary at the far left."
      },
      {
        "type": "obj",
        "bbox": [
          400,
          220,
          640,
          380
        ],
        "desc": "An orange rounded-square AWS service icon with a globe-and-waves glyph labeled Amazon CloudFront."
      },
      {
        "type": "obj",
        "bbox": [
          150,
          430,
          390,
          590
        ],
        "desc": "An orange rounded-square AWS service icon with a bucket glyph labeled Amazon S3 static assets, connected upward to CloudFront by a grey arrow."
      },
      {
        "type": "obj",
        "bbox": [
          400,
          450,
          640,
          610
        ],
        "desc": "An orange rounded-square AWS service icon with a gateway glyph labeled Amazon API Gateway."
      },
      {
        "type": "obj",
        "bbox": [
          400,
          660,
          640,
          820
        ],
        "desc": "An orange rounded-square AWS service icon with the lambda glyph labeled AWS Lambda, two small stacked copies suggesting multiple functions."
      },
      {
        "type": "obj",
        "bbox": [
          650,
          700,
          890,
          860
        ],
        "desc": "A dark-blue rounded-square AWS service icon with a database table glyph labeled Amazon DynamoDB, connected from Lambda by a grey arrow."
      }
    ]
  }
}
Generated AWS architecture diagram: users flowing through CloudFront and API Gateway to Lambda and DynamoDB, with S3 for static assets, inside a dashed AWS Cloud boundary
Output, 1536×1024, V4_DEFAULT_20, seed 11 — 9.5 / 10 every icon on its bbox
Why this matters. Bounding-box layout control from a prompt is something even most closed-source models can't do. Across every layout-critical generation in this article, the topology was never scrambled — boxes land where you put them.

Magic prompt — let an LLM write the JSON

Hand-writing captions is powerful but slow, so the repo ships magic prompt: an LLM expands your one-liner into a full schema-compliant caption, aspect-ratio aware. Backends: Ideogram's hosted endpoint (free with an IDEOGRAM_API_KEY), Claude Opus/Sonnet via OpenRouter using the open-source system prompt that ships in the repo, and a Gemini Flash-Lite backend I added — fractions of a cent per expansion.

Pipeline diagram: plain prompt to LLM to JSON caption to Ideogram 4, with Ideogram, Claude and Gemini as selectable LLM chips
The magic-prompt flow — illustration: GPT Image 2 (OpenAI)

Before & after: eleven words in

magic prompt

Input: "a ginger cat wearing a tiny wizard hat reading a spellbook"

Output (excerpt of ~3,000 chars):

{
  "high_level_description": "A painterly children's-book illustration
   of a ginger tabby cat in a pointed wizard hat studying an open
   spellbook glowing with a cyan sigil by candlelight...",
  "compositional_deconstruction": {
    "background": "A cozy wizard's study at twilight... warm amber
     light pools from the left where the candle burns, cool cyan
     glow spills from the book toward the right...",
    "elements": [
      {"type": "obj", "desc": "Crystal ball on a carved wooden
        pedestal, faint violet mist swirling inside..."},
      {"type": "obj", "desc": "Ginger tabby cat on a crimson velvet
        cushion... left side of face lit warm amber, right side
        washed cool cyan from the sigil..."},
      {"type": "text", "text": "INCANTATIO · III",
       "desc": "Hand-lettered header on the open left page..."}
    ]
  }
}
Generated image: ginger tabby cat in a wizard hat reading a glowing spellbook by candlelight, crystal ball beside it
The result — candle left, sigil right, split lighting on the cat's face, crystal ball, and the invented chapter heading. Every detail the LLM art-directed is in the image. 1024², V4_TURBO_12.

Gallery — photorealistic

I wrote a library of 14 hand-crafted JSON captions across every style I could think of and generated them all locally (V4_DEFAULT_20, seed 11). Each result below shows the source, the output, and an honest verdict. We start with photorealism.

Photorealistic studio portrait of an elderly fisherman in a navy cap and cream wool sweater, Rembrandt lighting
Studio portrait — 9.5 / 10
Photorealistic snow leopard resting on a rocky Himalayan cliff at golden hour, native 2048 by 2048
Native 2048² wildlife — Gemini magic prompt, 23.3 GB peak
fisherman caption.json — hand-written JSON prompt
{
  "high_level_description": "A tightly framed studio portrait photograph of an elderly Portuguese fisherman with deeply weathered skin, looking directly into the camera against a dark backdrop.",
  "style_description": {
    "aesthetics": "intimate, dignified, painterly realism",
    "lighting": "single Rembrandt key light from the upper left, soft falloff, dark moody background",
    "photo": "85mm portrait lens, f/2.0, shallow depth of field, eye-level, ultra-sharp focus on the eyes",
    "medium": "photograph",
    "color_palette": [
      "#2B2420",
      "#C8A172",
      "#704C2E",
      "#E8DCC8",
      "#1A1714"
    ]
  },
  "compositional_deconstruction": {
    "background": "A seamless charcoal-grey studio backdrop falling into near-black at the edges, with a faint warm gradient behind the subject's head.",
    "elements": [
      {
        "type": "obj",
        "desc": "An elderly man in his late seventies filling the frame from chest up. Deep wrinkles across his forehead and around pale blue eyes, white stubble on a sun-leathered face, a faded navy fisherman's cap pushed back on his head. He wears a coarse cream wool sweater with a rolled collar. His expression is calm and direct, the hint of a smile at the corner of his mouth."
      }
    ]
  }
}

The portrait asked for a single Rembrandt key light, an 85mm f/2 look, the navy cap and cream wool sweater — and delivered pore-level skin detail, white stubble, and catchlights in pale blue eyes with zero uncanny artifacts. Only miss: the requested "hint of a smile" came out neutral. The snow leopard ran the full pipeline — one plain sentence → my Gemini backend → JSON → native 2048², no upscaler.

Painting, cartoon, doodle, vector, storybook

Renaissance oil portrait of a young woman in an emerald velvet gown with gold-embroidered sleeves, window landscape upper right
Renaissance oil — 9.5 / 10 sfumato, glazing, craquelure
Cartoon red fox inventor with brass goggles and tool belt presenting a small hovering robot in an attic workshop
Cartoon fox — 8.5 / 10 style drifted painterly vs. cel-shaded
Hand-drawn notebook doodle page about coffee with the hand-lettered words BUT FIRST, COFFEE
Doodle page — 9 / 10 hand-lettering rendered perfectly
Minimal flat vector desert dunes at dawn with a striped hot air balloon and a small palm oasis
Flat vector dunes — 9 / 10 balloon exactly on its bbox
Storybook illustration: girl in a yellow raincoat with a lantern and her dog climbing toward a striped lighthouse in a storm
Storybook lighthouse — 9.5 / 10 the best prompt-adherence of the whole set: girl, lantern, matching-raincoat terrier, glowing windows, beam, pencil-stroke rain — every clause delivered

Switching photoart_style turns the model into a painter with deep art-history vocabulary: "sfumato blending, fine glazing layers, visible craquelure" produced exactly those. The one interesting miss of the whole experiment is the fox: every prop arrived (goggles, tool belt, one-eyed robot) but the requested flat TV-cartoon cel shading drifted toward painterly storybook. Style adherence is strong but not absolute — register can drift toward the model's design-forward house taste.

Software diagrams & cloud architecture

This is where bounding boxes earn their keep — and where the model's one systematic weakness shows. Structure: flawless. Small labels: occasionally wobbly.

Generated CI/CD flowchart with five labeled stages: commit, build, test, scan, deploy
CI/CD flowchart — 8.5 / 10 "SCAN" → "SCAH", one stray caption
Generated Google Cloud streaming pipeline diagram: Pub/Sub, Dataflow, BigQuery, Looker
GCP pipeline — 8.5 / 10 "Pub/Suk", "analytics wanshouse"
Generated microservices architecture: API gateway over four services with database cylinders and an event bus
Microservices system — 8.5 / 10 perfect topology, "AFI Gateway" typo
Rule of thumb. Trust it with topology — across all five layout-heavy generations, nothing was ever misplaced. Double-check the small print: medium-size labels are ~95% clean, tiny sub-captions garble. Thirty seconds of retouching makes any of these production-ready.

Infographics & magazine covers — English

Generated coffee infographic with title, two statistic callouts, an ascending bar chart and a source line
Coffee infographic — 8.5 / 10 words clean, chart numerals garbled
Generated CIRCUIT tech magazine cover with a humanoid robot overlapping the masthead
CIRCUIT cover — 9 / 10 masthead overlap exactly as designed

The infographic delivered its headline, both stat callouts (2.25 billion cups / 25 million growers), the ascending bars, and the source credit — but the numbers inside the chart corrupted ("652,00"). Lesson: words render better than digits. The magazine cover is near print-ready; I asked for the robot's head to partially overlap the masthead for depth and got exactly that layering. One small teaser line wobbled.

Arabic — the honest part

I saved Arabic for last deliberately, because the story is mixed and worth telling straight. Structurally these are impressive: the infographic flows right-to-left as Arabic must, text blocks are right-aligned, the three tip cards sit on their bounding boxes, and it even used proper Arabic-Indic numerals (١ ٢ ٣). The magazine's Andalusian-gate photography is stunning and the gold-on-green palette followed my hex codes exactly.

Generated Arabic water-conservation infographic, right-to-left layout with three tip cards — body text shows glyph artifacts
الحفاظ على الماء infographic — 6.5 / 10
Generated Arabic culture magazine cover over an Andalusian palace gate — masthead shows phantom glyphs
مجلة أفق cover — 7 / 10 carried by the photography
arabic infographic caption.json — RTL prompt with Arabic text elements
{
  "high_level_description": "انفوجرافيك عربي عمودي بعنوان الحفاظ على الماء يعرض ثلاث نصائح مرقمة مع أيقونات، بتخطيط من اليمين إلى اليسار وخط عربي حديث.",
  "style_description": {
    "aesthetics": "نظيف وهادئ بأسلوب تحريري عربي معاصر",
    "lighting": "إضاءة مسطحة مناسبة للطباعة",
    "medium": "vector infographic",
    "art_style": "تصميم مسطح حديث بأيقونات هندسية بسيطة وخط نسخ عربي واضح مع محاذاة يمينية كاملة",
    "color_palette": [
      "#EAF6F6",
      "#0E7490",
      "#22D3EE",
      "#155E75",
      "#FFFFFF"
    ]
  },
  "compositional_deconstruction": {
    "background": "خلفية فيروزية فاتحة عمودية مقسومة إلى شريط عنوان علوي وثلاث بطاقات بيضاء مستديرة الزوايا مرتبة عموديًا، مع موجات مائية رقيقة في أسفل التصميم.",
    "elements": [
      {
        "type": "text",
        "bbox": [
          50,
          150,
          170,
          850
        ],
        "text": "الحفاظ على الماء",
        "desc": "عنوان رئيسي كبير بخط عربي عريض باللون الأزرق الداكن في وسط الشريط العلوي، تحته قطرة ماء مرسومة ببساطة."
      },
      {
        "type": "text",
        "bbox": [
          240,
          100,
          400,
          900
        ],
        "text": "١. أغلق الصنبور أثناء تنظيف الأسنان",
        "desc": "البطاقة الأولى: نص النصيحة بمحاذاة اليمين وبجانبه من جهة اليمين أيقونة صنبور مغلق داخل دائرة فيروزية."
      },
      {
        "type": "text",
        "bbox": [
          440,
          100,
          600,
          900
        ],
        "text": "٢. استخدم الغسالة بحمولة كاملة",
        "desc": "البطاقة الثانية: نص النصيحة بمحاذاة اليمين وبجانبه أيقونة غسالة ملابس داخل دائرة فيروزية."
      },
      {
        "type": "text",
        "bbox": [
          640,
          100,
          800,
          900
        ],
        "text": "٣. اجمع ماء المطر لري النباتات",
        "desc": "البطاقة الثالثة: نص النصيحة بمحاذاة اليمين وبجانبه أيقونة سحابة مطر فوق نبتة داخل دائرة فيروزية."
      },
      {
        "type": "text",
        "bbox": [
          870,
          250,
          950,
          750
        ],
        "text": "كل قطرة تصنع فرقًا",
        "desc": "شعار ختامي صغير بخط رفيع في أسفل التصميم فوق الموجات المائية."
      }
    ]
  }
}
Where it falls short. Read the actual text and quality drops sharply: the infographic title picked up inserted phantom letters, body copy is heavily garbled with duplicated fragments, and the masthead أفق ("horizon" — three letters) grew extra glyphs. Arabic's connected, contextual letterforms are clearly much harder than Latin. Practical rule from these runs: keep Arabic short and display-sized — large single words fare far better than paragraphs — and budget for manual cleanup.

Quality scorecard — all 14, scored honestly

Each generation scored on four axes: adherence (did everything in the caption appear?), layout (were bboxes respected?), text (fidelity of rendered strings) and craft (quality of the medium itself). Set average: 8.6 / 10.

ExampleAdherenceLayoutTextCraftOverall
Fisherman portrait9.59109.5
Renaissance oil9.59.5109.5
Storybook lighthouse109.5109.5
AWS architecture9.59.599.59.5
Coffee doodles9.599.599
Vector dunes9.58.59.59
Tech magazine (EN)9.59.58.59.59
Cartoon fox98.598.5
CI/CD flowchart99.5898.5
GCP pipeline99.5798.5
Microservices99.57.598.5
Coffee infographic (EN)997.598.5
Culture magazine (AR)8.5959.57
Water infographic (AR)7.58.54.57.56.5
The pattern. Text fidelity degrades in a clean order: large EN > small EN > large AR > small AR. Put exact strings in "text" fields, keep elements ≤ 5 per image, prefer round numbers in charts — and step up to V4_QUALITY_48 for typography-critical finals.

RTX 4090 benchmarks — 21 runs, real numbers

Three presets × seven aspect ratios, fixed JSON caption and seed, one warmup excluded, torch.cuda.synchronize() around every run. Each timing is the complete pipeline: text encode, all sampling steps, VAE decode.

Aspect ratioResolutionTurbo (12)Default (20)Quality (48)Peak VRAM
1:11024×102420.9 s31.8 s72.0 s17.9 GB
1:1 hi-res2048×2048121.6 s199.7 s462.0 s23.3 GB
3:21536×102431.7 s51.2 s120.2 s18.8 GB
2:31024×153632.2 s51.1 s119.4 s18.8 GB
16:91920×108845.3 s73.4 s172.6 s19.6 GB
9:161088×192045.3 s73.5 s173.3 s19.6 GB
4:1 banner1600×40012.2 s18.7 s42.3 s17.2 GB
2x2 benchmark dashboard: grouped bars by preset, log-log scaling vs megapixels, per-step cost by resolution, peak VRAM vs megapixels with the 24 GB ceiling marked
The full dashboard — chart: author's benchmark, ideogram4 / benchmark / results.csv
Turbo preset bar chart: seconds per aspect ratio with per-step and VRAM annotations
Turbo (12 steps) — chart: author's benchmark
Default preset bar chart: seconds per aspect ratio with per-step and VRAM annotations
Default (20 steps) — chart: author's benchmark
Quality preset bar chart: 462 seconds at 2048 squared towering over the other aspect ratios
Quality (48 steps) — 2048² takes nearly 8 minutes. Chart: author's benchmark

The three scaling laws

  • Linear in steps. Per-step cost is constant per resolution across presets (1024²: 1.74 → 1.59 → 1.50 s/step). Quality ≈ 2.3× Default; Turbo saves ~35%.
  • Super-linear in pixels. 4× the area costs ~6.3× the time — quadratic attention over 16k image tokens at 2048². The log-log curve sits between linear and quadratic.
  • VRAM follows resolution only, never steps: 17.2 → 23.3 GB, kissing the 4090's ceiling at 2048². Orientation is free — portrait ≡ landscape to the decimal.

The practical recipe: draft in Turbo at 1024², work in Default at ≤ 2 MP, and reserve hi-res Quality renders for finals (and a coffee break).

References & further reading