Glossary

Plain-language definitions of the terms used across Modelglass — model architectures, training methods, the capability axes we score, and the core concepts behind image generation. 36 terms.

Adversarial distillation Training

Distillation that adds a GAN-style discriminator so the few-step student is pushed to produce sharp, realistic outputs rather than the blurry averages few-step students otherwise tend toward. Used to keep quality high at very low step counts.

Sharper few-step output than plain distillation
More complex training; can introduce artifacts if mistuned

Examples: Latent Adversarial Diffusion Distillation (FLUX schnell, SDXL-Turbo)

See also: Distillation , GAN (Generative Adversarial Network)

Artistic style range Capability

The breadth of styles a model can produce (illustration, painting, 3D, anime, graphic design) and how well it follows style instructions. A wide ecosystem of fine-tunes/LoRAs effectively extends this axis.

See also: Fine-tuning

Autoregressive Architecture

Generates an image as a sequence of tokens (like a language model emits text), one chunk at a time. Conceptually unifies image and text generation and can give strong prompt adherence, but sequential decoding can be slow.

Unifies with LLM tooling; strong instruction following
Sequential token decoding can be slow at high resolutions

Examples: Parti, token-based image models

Base (foundation) model Training

A model trained from scratch on a large image-text corpus. It defines the ceiling of what downstream fine-tunes and distillations can do. When people compare "the model," they usually mean the base checkpoint.

Most general and capable starting point
Largest and most expensive to train and to run at full step count

CFG scale (classifier-free guidance) Concept

A knob controlling how strongly generation is pulled toward the prompt. The model runs both a prompt-conditioned and an unconditioned prediction each step and extrapolates between them; the CFG scale sets how far. Low values drift off-prompt but look natural; high values follow the prompt hard but can oversaturate colours and introduce artifacts. Typical values sit around 5-9 for many diffusion models.

Also: guidance scale, classifier-free guidance, cfg

Higher scale = stronger prompt adherence
Too high oversaturates, burns contrast, and can deep-fry the image
Too low ignores parts of the prompt

See also: Negative prompt , Latent diffusion

Compositional accuracy Capability

Getting multi-object scenes right: correct counts, spatial relationships ("A on top of B"), and binding the right attribute to the right object (the "red cube, blue sphere" problem). Measured by GenEval and T2I-CompBench.

Examples: T2I-CompBench, GenEval counting/position

See also: Prompt adherence

ControlNet Concept

An add-on network that conditions generation on a structural input — an edge map, depth map, pose skeleton, or scribble — so the output follows a specified composition or layout while the prompt controls content and style. The standard way to get precise spatial control out of diffusion models.

Precise control over pose, layout, and structure
Requires preparing a control image and adds inference cost

See also: Latent diffusion

DDIM (Denoising Diffusion Implicit Models) Architecture

A deterministic, non-Markovian sampler that produces good images in far fewer steps than DDPM (e.g. 20-50). It is a sampling method layered on a diffusion model, not a different model — most schedulers you see in tools are DDIM-style.

Much faster sampling than DDPM at similar quality
Deterministic given a seed, which aids reproducibility

See also: DDPM (Denoising Diffusion Probabilistic Models) , LCM (Latent Consistency Models)

DDPM (Denoising Diffusion Probabilistic Models) Architecture

The original diffusion formulation: a fixed forward process gradually adds Gaussian noise, and the model learns to reverse it one small step at a time. High quality but needs many steps (often 50-1000), so it is slow.

Stable training, high sample quality
Slow: many sequential denoising steps

See also: DDIM (Denoising Diffusion Implicit Models) , Latent diffusion

Diffusion transformer (DiT) Architecture

A diffusion/flow model whose backbone is a transformer operating on image latent patches, replacing the convolutional U-Net. Transformers scale more predictably with data and parameters, which is why most frontier image models have moved to DiT-style backbones.

Predictable scaling, strong text-image alignment at scale
Compute-hungry; benefits most at large parameter counts

Examples: FLUX.1, Stable Diffusion 3, PixArt

See also: Flow matching / rectified flow , Latent diffusion

Distillation Training

Training a smaller or fewer-step "student" to reproduce the behaviour of a slower "teacher" model. The headline payoff is dramatically cheaper, faster inference (e.g. 1-4 steps instead of 30+), usually with a small quality cost.

Large inference speed/cost reduction
Some loss of fine detail, prompt nuance, or diversity vs the teacher

Examples: FLUX.1 [schnell] distilled from FLUX.1 [dev], LCM

See also: Adversarial distillation , LCM (Latent Consistency Models) , Base (foundation) model

DPO (Direct Preference Optimization) Training

A simpler alternative to RLHF that optimises directly on preference pairs without training a separate reward model. Increasingly applied to image models to nudge them toward preferred aesthetics and adherence.

Lighter-weight than full RLHF pipelines
Still inherits the biases of the preference data

See also: RLHF / RLAIF

Fine-tuning Training

Continued training of a base model on a narrower dataset to specialise it (a style, a domain, a refiner stage). LoRA is the common lightweight form. Output characteristics shift toward the fine-tuning data.

Cheap way to specialise; rich ecosystem (LoRAs, refiners)
Can narrow diversity or overfit a look if overdone

Examples: SDXL refiner, community LoRAs

See also: Base (foundation) model

Flow matching / rectified flow Architecture

A newer generative formulation that learns a straight-line "velocity field" transporting noise to data, instead of a stepwise denoising chain. The straighter paths mean high quality in fewer steps and cleaner scaling. FLUX and Stable Diffusion 3 use rectified-flow transformers.

Strong quality with fewer steps; scales well
Newer, with a smaller tooling/fine-tuning ecosystem than classic diffusion

Examples: FLUX.1, Stable Diffusion 3

See also: Diffusion transformer (DiT) , Latent diffusion

GAN (Generative Adversarial Network) Architecture

A generator and a discriminator trained against each other; the generator produces an image in a single forward pass. Extremely fast at inference, but historically harder to train and weaker at open-ended prompt following than diffusion. Now mostly used for speed-critical or distillation roles.

Single-pass, very fast generation
Unstable training; weaker prompt diversity/coverage than diffusion

See also: Adversarial distillation

Hero / marketing asset Quality axis

The image is the focal point at large size — landing pages, ads, key art. Quality means high fidelity, correct details, on-brief composition, and usually legible text. Worth paying for a stronger model and more steps.

See also: Photorealism , Text rendering , Resolution ceiling

Inference Concept

Running a trained model to produce an output, as distinct from training (teaching the model in the first place). When you pay per image at an API provider you are paying for inference compute — the model's weights are already fixed; each request just runs them forward over your prompt.

The cost you actually pay per generation at an API provider
Latency and price scale with steps, resolution, and model size

See also: Steps , Sampler

Inference speed Capability

How quickly the model produces an image, driven mainly by step count and backbone size. Directly tied to per-image cost on compute-billed hosts and to user-facing latency.

Few-step/distilled models are fast and cheap but trade some fidelity

See also: Distillation , LCM (Latent Consistency Models)

Inpainting Concept

Regenerating only a masked region of an existing image while keeping the rest fixed — used to remove objects, fix hands, or edit a detail without redrawing the whole picture. Outpainting is the same idea extended beyond the original canvas to expand the scene.

Also: outpainting

Targeted edits without regenerating the entire image
Seams and lighting/style mismatches at the mask boundary if guidance is weak

Latent diffusion Architecture

A diffusion model that denoises in a compressed latent space (via a VAE) rather than at full pixel resolution. Generation starts from noise and is iteratively denoised over many steps. Working in latent space is what made high-resolution generation affordable. Most "Stable Diffusion"-family models are latent diffusion with a U-Net backbone.

Strong quality and a huge fine-tuning/LoRA ecosystem
Many denoising steps means slower, more expensive inference than few-step methods
U-Net backbones scale less cleanly than transformers

Examples: Stable Diffusion 1.5/2/XL, most open-weight image models

See also: DDPM (Denoising Diffusion Probabilistic Models) , DDIM (Denoising Diffusion Implicit Models) , Diffusion transformer (DiT) , Flow matching / rectified flow

LCM (Latent Consistency Models) Architecture

A distillation technique that trains a model to jump most of the way to the final image in a handful of steps (often 1-8) by learning a consistency mapping. Trades a little fidelity for a large speed/cost win.

Very fast, few-step inference
Slightly lower peak fidelity and detail than the full multi-step model

Examples: LCM-LoRA variants of SD/SDXL

See also: Distillation , DDIM (Denoising Diffusion Implicit Models)

LoRA (Low-Rank Adaptation) Concept

A lightweight fine-tuning method that freezes the base model and trains small low-rank weight matrices alongside it. The resulting adapter is a few megabytes instead of gigabytes, can be trained on modest hardware, and is loaded on top of the base model at inference to add a style, character, or concept. LoRAs can be stacked and weighted, which is why open-weight ecosystems have thousands of them.

Also: low-rank adaptation

Cheap to train and tiny to share versus a full fine-tune
Composable: stack/blend multiple adapters at inference
Less expressive than a full fine-tune; quality depends on the base model

See also: Distillation , Latent diffusion

Negative prompt Concept

Text describing what you do not want in the image (e.g. "blurry, extra fingers, watermark"). Implemented via classifier-free guidance: instead of guiding away from an empty unconditioned prediction, the model guides away from the negative prompt's prediction, steering the result clear of those concepts. Not all models or APIs expose it.

A direct lever for suppressing common failure modes and unwanted content
Overuse can flatten variety or fight the positive prompt

See also: CFG scale (classifier-free guidance)

Photorealism Capability

How convincingly the model renders real-world scenes, lighting, skin, and materials. Distinct from "looks nice" — a model can be highly aesthetic but stylised rather than photoreal.

Print / high-resolution asset Quality axis

The image must hold up at print resolution and physical size. Demands the highest native resolution (or clean upscaling), no artifacts under scrutiny, and accurate detail. The most demanding axis; favours top-tier models.

See also: Resolution ceiling , Photorealism

Prompt adherence Capability

How faithfully the image reflects everything the prompt asked for — objects, attributes, relationships, and intent. The single most important axis for most production use, and what alignment benchmarks (GenEval, DPG-Bench, CLIPScore) try to measure.

Examples: GenEval, DPG-Bench

See also: Compositional accuracy

Rapid iteration / draft Quality axis

Quality good enough to explore ideas, storyboard, or generate many options fast and cheap. The operator cares about throughput and cost per image far more than peak fidelity — a fast, distilled model is usually the right call.

Examples: concepting, A/B option generation, moodboards

See also: Inference speed

Resolution ceiling Capability

The largest / highest-quality native output the model produces before needing upscaling, and the set of aspect ratios it supports well. Matters for hero and print assets.

See also: Print / high-resolution asset

RLHF / RLAIF Training

Post-training that optimises the model against a reward signal derived from human (RLHF) or AI (RLAIF) preferences. Pulls outputs toward what people tend to rate highly — aesthetics, prompt adherence, safety — sometimes at the cost of diversity.

Better-aligned, more 'preferred' outputs
Can reduce diversity or over-optimise to the reward model's biases

See also: DPO (Direct Preference Optimization)

Sampler Concept

The algorithm that performs each denoising step — deciding how far to move and in what direction at every iteration. Different samplers (DDIM, DPM++, Euler, UniPC) trade speed against quality and reproducibility. Some providers expose sampler choice; many hosted APIs fix it.

Also: scheduler

Sampler choice affects quality, convergence speed, and determinism
Not all hosted APIs let you pick one

See also: Steps , DDIM (Denoising Diffusion Implicit Models) , DDPM (Denoising Diffusion Probabilistic Models)

Seed Concept

The number that initializes the random noise a generation starts from. Fixing the seed (with the same prompt and settings) makes a result reproducible; changing it explores variations. Seeds are how you A/B a single prompt change or recover an image you liked.

Reproducibility: same seed + same inputs = same image
Seed behaviour is implementation-specific and rarely portable across models

Steps Concept

The number of denoising iterations a diffusion model runs to turn noise into an image. More steps generally improve quality up to a point of diminishing returns, at the cost of higher latency and — for per-step or per-second pricing — higher cost. Few-step and distilled models (LCM, Turbo, schnell) deliberately cut this to a handful.

More steps = more detail/coherence, up to a plateau
Each step adds latency and, on metered pricing, cost

See also: Inference , Sampler , Latent diffusion

Text rendering Capability

The ability to render legible, correctly-spelled text inside the image (signs, logos, labels). A long-standing weak spot for diffusion models; newer transformer-backbone models are markedly better.

Thumbnail / small surface Quality axis

The image is consumed small (feed thumbnails, avatars, icons, list rows). Fine detail and perfect text are largely invisible at display size, so "quality" means clean composition and strong silhouette, not pixel-level fidelity. Cheap models punch above their weight here.

Upscaling Concept

Increasing an image's resolution after generation. Diffusion-based upscalers add plausible detail rather than just interpolating pixels, which is how models that natively target ~1-2 megapixels reach print sizes. Often paired with a light second diffusion pass ("hi-res fix") to sharpen detail.

Also: super-resolution, hi-res fix

Reaches resolutions beyond a model's native ceiling
Can hallucinate detail that was not in the original

See also: Resolution ceiling

VAE (Variational Autoencoder) Concept

The encoder/decoder that compresses an image into a small latent representation and reconstructs it back into pixels. Latent diffusion models denoise in this compressed space — far cheaper than working at full resolution — and the VAE decoder turns the final latent into the visible image. A weak or mismatched VAE shows up as washed-out colour, mushy detail, or artifacts even when the diffusion model itself is good.

Also: variational autoencoder, autoencoder

Makes high-resolution generation affordable by shrinking what the model operates on
Lossy: fine detail and exact colour can be degraded in the encode/decode round-trip

See also: Latent diffusion