Glossary
Plain-language definitions of the terms used across Modelglass — model architectures, training methods, the capability axes we score, and the core concepts behind image generation. 36 terms.
- Adversarial distillation Training
-
Distillation that adds a GAN-style discriminator so the few-step student is pushed to produce sharp, realistic outputs rather than the blurry averages few-step students otherwise tend toward. Used to keep quality high at very low step counts.
- Sharper few-step output than plain distillation
- More complex training; can introduce artifacts if mistuned
Examples: Latent Adversarial Diffusion Distillation (FLUX schnell, SDXL-Turbo)
See also: Distillation , GAN (Generative Adversarial Network)
- Artistic style range Capability
-
The breadth of styles a model can produce (illustration, painting, 3D, anime, graphic design) and how well it follows style instructions. A wide ecosystem of fine-tunes/LoRAs effectively extends this axis.
See also: Fine-tuning
- Autoregressive Architecture
-
Generates an image as a sequence of tokens (like a language model emits text), one chunk at a time. Conceptually unifies image and text generation and can give strong prompt adherence, but sequential decoding can be slow.
- Unifies with LLM tooling; strong instruction following
- Sequential token decoding can be slow at high resolutions
Examples: Parti, token-based image models
- Base (foundation) model Training
-
A model trained from scratch on a large image-text corpus. It defines the ceiling of what downstream fine-tunes and distillations can do. When people compare "the model," they usually mean the base checkpoint.
- Most general and capable starting point
- Largest and most expensive to train and to run at full step count
- CFG scale (classifier-free guidance) Concept
-
A knob controlling how strongly generation is pulled toward the prompt. The model runs both a prompt-conditioned and an unconditioned prediction each step and extrapolates between them; the CFG scale sets how far. Low values drift off-prompt but look natural; high values follow the prompt hard but can oversaturate colours and introduce artifacts. Typical values sit around 5-9 for many diffusion models.
Also: guidance scale, classifier-free guidance, cfg
- Higher scale = stronger prompt adherence
- Too high oversaturates, burns contrast, and can deep-fry the image
- Too low ignores parts of the prompt
See also: Negative prompt , Latent diffusion
- Compositional accuracy Capability
-
Getting multi-object scenes right: correct counts, spatial relationships ("A on top of B"), and binding the right attribute to the right object (the "red cube, blue sphere" problem). Measured by GenEval and T2I-CompBench.
Examples: T2I-CompBench, GenEval counting/position
See also: Prompt adherence
- ControlNet Concept
-
An add-on network that conditions generation on a structural input — an edge map, depth map, pose skeleton, or scribble — so the output follows a specified composition or layout while the prompt controls content and style. The standard way to get precise spatial control out of diffusion models.
- Precise control over pose, layout, and structure
- Requires preparing a control image and adds inference cost
See also: Latent diffusion
- DDIM (Denoising Diffusion Implicit Models) Architecture
-
A deterministic, non-Markovian sampler that produces good images in far fewer steps than DDPM (e.g. 20-50). It is a sampling method layered on a diffusion model, not a different model — most schedulers you see in tools are DDIM-style.
- Much faster sampling than DDPM at similar quality
- Deterministic given a seed, which aids reproducibility
See also: DDPM (Denoising Diffusion Probabilistic Models) , LCM (Latent Consistency Models)
- DDPM (Denoising Diffusion Probabilistic Models) Architecture
-
The original diffusion formulation: a fixed forward process gradually adds Gaussian noise, and the model learns to reverse it one small step at a time. High quality but needs many steps (often 50-1000), so it is slow.
- Stable training, high sample quality
- Slow: many sequential denoising steps
See also: DDIM (Denoising Diffusion Implicit Models) , Latent diffusion
- Diffusion transformer (DiT) Architecture
-
A diffusion/flow model whose backbone is a transformer operating on image latent patches, replacing the convolutional U-Net. Transformers scale more predictably with data and parameters, which is why most frontier image models have moved to DiT-style backbones.
- Predictable scaling, strong text-image alignment at scale
- Compute-hungry; benefits most at large parameter counts
Examples: FLUX.1, Stable Diffusion 3, PixArt
See also: Flow matching / rectified flow , Latent diffusion
- Distillation Training
-
Training a smaller or fewer-step "student" to reproduce the behaviour of a slower "teacher" model. The headline payoff is dramatically cheaper, faster inference (e.g. 1-4 steps instead of 30+), usually with a small quality cost.
- Large inference speed/cost reduction
- Some loss of fine detail, prompt nuance, or diversity vs the teacher
Examples: FLUX.1 [schnell] distilled from FLUX.1 [dev], LCM
See also: Adversarial distillation , LCM (Latent Consistency Models) , Base (foundation) model
- DPO (Direct Preference Optimization) Training
-
A simpler alternative to RLHF that optimises directly on preference pairs without training a separate reward model. Increasingly applied to image models to nudge them toward preferred aesthetics and adherence.
- Lighter-weight than full RLHF pipelines
- Still inherits the biases of the preference data
See also: RLHF / RLAIF
- Fine-tuning Training
-
Continued training of a base model on a narrower dataset to specialise it (a style, a domain, a refiner stage). LoRA is the common lightweight form. Output characteristics shift toward the fine-tuning data.
- Cheap way to specialise; rich ecosystem (LoRAs, refiners)
- Can narrow diversity or overfit a look if overdone
Examples: SDXL refiner, community LoRAs
See also: Base (foundation) model
- Flow matching / rectified flow Architecture
-
A newer generative formulation that learns a straight-line "velocity field" transporting noise to data, instead of a stepwise denoising chain. The straighter paths mean high quality in fewer steps and cleaner scaling. FLUX and Stable Diffusion 3 use rectified-flow transformers.
- Strong quality with fewer steps; scales well
- Newer, with a smaller tooling/fine-tuning ecosystem than classic diffusion
Examples: FLUX.1, Stable Diffusion 3
See also: Diffusion transformer (DiT) , Latent diffusion
- GAN (Generative Adversarial Network) Architecture
-
A generator and a discriminator trained against each other; the generator produces an image in a single forward pass. Extremely fast at inference, but historically harder to train and weaker at open-ended prompt following than diffusion. Now mostly used for speed-critical or distillation roles.
- Single-pass, very fast generation
- Unstable training; weaker prompt diversity/coverage than diffusion
See also: Adversarial distillation
- Hero / marketing asset Quality axis
-
The image is the focal point at large size — landing pages, ads, key art. Quality means high fidelity, correct details, on-brief composition, and usually legible text. Worth paying for a stronger model and more steps.
See also: Photorealism , Text rendering , Resolution ceiling
- Inference Concept
-
Running a trained model to produce an output, as distinct from training (teaching the model in the first place). When you pay per image at an API provider you are paying for inference compute — the model's weights are already fixed; each request just runs them forward over your prompt.
- The cost you actually pay per generation at an API provider
- Latency and price scale with steps, resolution, and model size
- Inference speed Capability
-
How quickly the model produces an image, driven mainly by step count and backbone size. Directly tied to per-image cost on compute-billed hosts and to user-facing latency.
- Few-step/distilled models are fast and cheap but trade some fidelity
See also: Distillation , LCM (Latent Consistency Models)
- Inpainting Concept
-
Regenerating only a masked region of an existing image while keeping the rest fixed — used to remove objects, fix hands, or edit a detail without redrawing the whole picture. Outpainting is the same idea extended beyond the original canvas to expand the scene.
Also: outpainting
- Targeted edits without regenerating the entire image
- Seams and lighting/style mismatches at the mask boundary if guidance is weak
- Latent diffusion Architecture
-
A diffusion model that denoises in a compressed latent space (via a VAE) rather than at full pixel resolution. Generation starts from noise and is iteratively denoised over many steps. Working in latent space is what made high-resolution generation affordable. Most "Stable Diffusion"-family models are latent diffusion with a U-Net backbone.
- Strong quality and a huge fine-tuning/LoRA ecosystem
- Many denoising steps means slower, more expensive inference than few-step methods
- U-Net backbones scale less cleanly than transformers
Examples: Stable Diffusion 1.5/2/XL, most open-weight image models
See also: DDPM (Denoising Diffusion Probabilistic Models) , DDIM (Denoising Diffusion Implicit Models) , Diffusion transformer (DiT) , Flow matching / rectified flow
- LCM (Latent Consistency Models) Architecture
-
A distillation technique that trains a model to jump most of the way to the final image in a handful of steps (often 1-8) by learning a consistency mapping. Trades a little fidelity for a large speed/cost win.
- Very fast, few-step inference
- Slightly lower peak fidelity and detail than the full multi-step model
Examples: LCM-LoRA variants of SD/SDXL
See also: Distillation , DDIM (Denoising Diffusion Implicit Models)
- LoRA (Low-Rank Adaptation) Concept
-
A lightweight fine-tuning method that freezes the base model and trains small low-rank weight matrices alongside it. The resulting adapter is a few megabytes instead of gigabytes, can be trained on modest hardware, and is loaded on top of the base model at inference to add a style, character, or concept. LoRAs can be stacked and weighted, which is why open-weight ecosystems have thousands of them.
Also: low-rank adaptation
- Cheap to train and tiny to share versus a full fine-tune
- Composable: stack/blend multiple adapters at inference
- Less expressive than a full fine-tune; quality depends on the base model
See also: Distillation , Latent diffusion
- Negative prompt Concept
-
Text describing what you do not want in the image (e.g. "blurry, extra fingers, watermark"). Implemented via classifier-free guidance: instead of guiding away from an empty unconditioned prediction, the model guides away from the negative prompt's prediction, steering the result clear of those concepts. Not all models or APIs expose it.
- A direct lever for suppressing common failure modes and unwanted content
- Overuse can flatten variety or fight the positive prompt
See also: CFG scale (classifier-free guidance)
- Photorealism Capability
-
How convincingly the model renders real-world scenes, lighting, skin, and materials. Distinct from "looks nice" — a model can be highly aesthetic but stylised rather than photoreal.
- Print / high-resolution asset Quality axis
-
The image must hold up at print resolution and physical size. Demands the highest native resolution (or clean upscaling), no artifacts under scrutiny, and accurate detail. The most demanding axis; favours top-tier models.
See also: Resolution ceiling , Photorealism
- Prompt adherence Capability
-
How faithfully the image reflects everything the prompt asked for — objects, attributes, relationships, and intent. The single most important axis for most production use, and what alignment benchmarks (GenEval, DPG-Bench, CLIPScore) try to measure.
Examples: GenEval, DPG-Bench
See also: Compositional accuracy
- Rapid iteration / draft Quality axis
-
Quality good enough to explore ideas, storyboard, or generate many options fast and cheap. The operator cares about throughput and cost per image far more than peak fidelity — a fast, distilled model is usually the right call.
Examples: concepting, A/B option generation, moodboards
See also: Inference speed
- Resolution ceiling Capability
-
The largest / highest-quality native output the model produces before needing upscaling, and the set of aspect ratios it supports well. Matters for hero and print assets.
See also: Print / high-resolution asset
- RLHF / RLAIF Training
-
Post-training that optimises the model against a reward signal derived from human (RLHF) or AI (RLAIF) preferences. Pulls outputs toward what people tend to rate highly — aesthetics, prompt adherence, safety — sometimes at the cost of diversity.
- Better-aligned, more 'preferred' outputs
- Can reduce diversity or over-optimise to the reward model's biases
See also: DPO (Direct Preference Optimization)
- Sampler Concept
-
The algorithm that performs each denoising step — deciding how far to move and in what direction at every iteration. Different samplers (DDIM, DPM++, Euler, UniPC) trade speed against quality and reproducibility. Some providers expose sampler choice; many hosted APIs fix it.
Also: scheduler
- Sampler choice affects quality, convergence speed, and determinism
- Not all hosted APIs let you pick one
See also: Steps , DDIM (Denoising Diffusion Implicit Models) , DDPM (Denoising Diffusion Probabilistic Models)
- Seed Concept
-
The number that initializes the random noise a generation starts from. Fixing the seed (with the same prompt and settings) makes a result reproducible; changing it explores variations. Seeds are how you A/B a single prompt change or recover an image you liked.
- Reproducibility: same seed + same inputs = same image
- Seed behaviour is implementation-specific and rarely portable across models
- Steps Concept
-
The number of denoising iterations a diffusion model runs to turn noise into an image. More steps generally improve quality up to a point of diminishing returns, at the cost of higher latency and — for per-step or per-second pricing — higher cost. Few-step and distilled models (LCM, Turbo, schnell) deliberately cut this to a handful.
- More steps = more detail/coherence, up to a plateau
- Each step adds latency and, on metered pricing, cost
See also: Inference , Sampler , Latent diffusion
- Text rendering Capability
-
The ability to render legible, correctly-spelled text inside the image (signs, logos, labels). A long-standing weak spot for diffusion models; newer transformer-backbone models are markedly better.
- Thumbnail / small surface Quality axis
-
The image is consumed small (feed thumbnails, avatars, icons, list rows). Fine detail and perfect text are largely invisible at display size, so "quality" means clean composition and strong silhouette, not pixel-level fidelity. Cheap models punch above their weight here.
- Upscaling Concept
-
Increasing an image's resolution after generation. Diffusion-based upscalers add plausible detail rather than just interpolating pixels, which is how models that natively target ~1-2 megapixels reach print sizes. Often paired with a light second diffusion pass ("hi-res fix") to sharpen detail.
Also: super-resolution, hi-res fix
- Reaches resolutions beyond a model's native ceiling
- Can hallucinate detail that was not in the original
See also: Resolution ceiling
- VAE (Variational Autoencoder) Concept
-
The encoder/decoder that compresses an image into a small latent representation and reconstructs it back into pixels. Latent diffusion models denoise in this compressed space — far cheaper than working at full resolution — and the VAE decoder turns the final latent into the visible image. A weak or mismatched VAE shows up as washed-out colour, mushy detail, or artifacts even when the diffusion model itself is good.
Also: variational autoencoder, autoencoder
- Makes high-resolution generation affordable by shrinking what the model operates on
- Lossy: fine detail and exact colour can be degraded in the encode/decode round-trip
See also: Latent diffusion