How Does AI Actually Create an Image? Diffusion Models Explained
When you order a personalised children’s book at myownchildbook.com, an AI generates an illustration within seconds that has never existed before. But how does it actually work? No pixel is copied, no existing image is reused. Here’s what really happens behind the scenes.

Step 1: understanding the text
The AI doesn’t start with a blank canvas. It starts with your description: “A 4-year-old girl with red hair and a blue dress walks through an enchanted forest.”
That sentence is first processed by a text encoder - a separate neural network that converts the meaning of language into a series of numbers. Every combination of words produces a unique numerical “fingerprint”. “Enchanted forest” is a different set of numbers than “ordinary forest”. Those numbers will guide the entire generation process.
The most widely used system for this is CLIP (Contrastive Language-Image Pretraining), trained on hundreds of millions of text-image pairs. It has learned how words like “picturesque”, “warm light” and “watercolour” relate to visual characteristics.
Step 2: starting from noise

This is where the clever part happens. A diffusion model generates an image by doing the reverse of what you might expect.
Training: the model was shown thousands of images that were gradually converted into random noise - like a photo slowly fading to static on a TV screen. At each step, the model learned to predict which noise had been added.
Generating: now the model reverses this process. It starts with a canvas of pure random noise and removes noise step by step, guided by the text encoder. After tens to hundreds of steps, a recognisable image appears.
This is called denoising diffusion. At each step the model predicts: “what is the most likely image here, given this text and this partially-noisy state?”
Step 3: the latent space
Modern models don’t work with full images directly - that would require enormous computing power. Instead they work in a compressed latent space. An image of 1024x1024 pixels is first compressed into a much smaller representation - the essence of the image, without every individual pixel.
The diffusion process takes place in that compressed space. Only at the end does a decoder expand the result back into the full image. This makes the process far more efficient without any loss of quality.
Step 4: style consistency

When you choose a style at myownchildbook.com - for example soft watercolour - that style description is included in every prompt for every page. The text encoder “knows” that watercolour is associated with soft edges, transparent layers and warm tones.
Because every image uses the same text encoder and the same diffusion model with a similar style description, visual coherence emerges - even though each illustration is generated independently.
What makes it unique?
A diffusion model doesn’t copy an existing image. It has learned patterns: how light falls, how hair moves, how a forest can look “enchanted”. It combines those patterns in a new way for your specific prompt.
Every image generated this way exists for the first time at the moment you ask for it.