How Does AI Actually Create an Image? Diffusion Models Explained

When you order a personalised children’s book at myownchildbook.com, an AI generates an illustration within seconds that has never existed before. But how does it actually work? No pixel is copied, no existing image is reused. Here’s what really happens behind the scenes.

Step 1: understanding the text

The AI doesn’t start with a blank canvas. It starts with your description: “A 4-year-old girl with red hair and a blue dress walks through an enchanted forest.”

That sentence is first processed by a text encoder - a separate neural network that converts the meaning of language into a series of numbers. Every combination of words produces a unique numerical “fingerprint”. “Enchanted forest” is a different set of numbers than “ordinary forest”. Those numbers will guide the entire generation process.

The most widely used system for this is CLIP (Contrastive Language-Image Pretraining), trained on hundreds of millions of text-image pairs. It has learned how words like “picturesque”, “warm light” and “watercolour” relate to visual characteristics.

Step 2: starting from noise

This is where the clever part happens. A diffusion model generates an image by doing the reverse of what you might expect.

Training: the model was shown thousands of images that were gradually converted into random noise - like a photo slowly fading to static on a TV screen. At each step, the model learned to predict which noise had been added.

Generating: now the model reverses this process. It starts with a canvas of pure random noise and removes noise step by step, guided by the text encoder. After tens to hundreds of steps, a recognisable image appears.

This is called denoising diffusion. At each step the model predicts: “what is the most likely image here, given this text and this partially-noisy state?”

Step 3: the latent space

Modern models don’t work with full images directly - that would require enormous computing power. Instead they work in a compressed latent space. An image of 1024x1024 pixels is first compressed into a much smaller representation - the essence of the image, without every individual pixel.

The diffusion process takes place in that compressed space. Only at the end does a decoder expand the result back into the full image. This makes the process far more efficient without any loss of quality.

Step 4: style consistency

When you choose a style at myownchildbook.com - for example soft watercolour - that style description is included in every prompt for every page. The text encoder “knows” that watercolour is associated with soft edges, transparent layers and warm tones.

Because every image uses the same text encoder and the same diffusion model with a similar style description, visual coherence emerges - even though each illustration is generated independently.

What makes it unique?

A diffusion model doesn’t copy an existing image. It has learned patterns: how light falls, how hair moves, how a forest can look “enchanted”. It combines those patterns in a new way for your specific prompt.

Every image generated this way exists for the first time at the moment you ask for it.

Want to see how this technology is applied to create a complete picture book? Read: How does AI create illustrations for children’s books?

Or find out how to create your own: creating a children’s book with AI.

For a complete walkthrough of how we use AI in our children’s-book pipeline — models, quality controls, privacy — read our complete guide to AI children’s books.

👉 Create your own personalised children’s book now

Step 1: understanding the text

Step 2: starting from noise

Step 3: the latent space

Step 4: style consistency

What makes it unique?

Your child as the main character?