Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit f2554a3

Browse files
Update README.md
1 parent 610d132 commit f2554a3

File tree

1 file changed

+26
-0
lines changed

1 file changed

+26
-0
lines changed

‎README.md‎

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,32 @@ Enter a prompt or pick up a picture and press "Generate" (You don't need to prep
2121

2222
![The concept](https://github.com/swiftuiux/coreml-stable-diffusion-swift-example/blob/main/img/img_03.png)
2323

24+
## How it works
25+
26+
### Super short
27+
words → numbers → math → picture.
28+
29+
### So in short:
30+
text → (TextEncoder) → numbers
31+
numbers + noise → (U-Net) → hidden image
32+
hidden image → (VAE Decoder) → real image
33+
real image → (SafetyChecker) → safe output
34+
35+
### Basically
36+
37+
1. You type "a red apple". vocab.json + merges.txt handle tokenization → break it into units like [a] [red] [apple]. TextEncoder.mlmodelc maps those tokens into numerical vectors in latent space.
38+
2. The model’s brain (U-Net).
39+
It starts with just random noise (a messy canvas). Then, step by step, it removes noise and adds structure, following the instructions from your text (the numbers from the TextEncoder). After many steps, what was just noise slowly looks like the picture you asked for. At this stage, the image is not made of pixels like (red, green, blue dots). Instead, it lives in a latent space — basically a compressed mathematical version of the image.
40+
3. Hidden space (latent space). It’s the hidden mathematical space where the U-Net operates. Latent space = a hidden, compressed version of images where the model does its work. Instead of dealing with millions of pixels directly (which is heavy), the model uses a smaller grid of numbers that still captures the essence of shapes, colors, and structures. Think of it like a sketch or a blueprint: not the full detailed image, but enough information to reconstruct it later. This is why it’s called latent (hidden): the image exists there, but only as math.
41+
• Latent space = where (Think of it as the canvas the painter is working on)
42+
• U-Net = how (Think of it as the work being done (the painter’s hand moving)
43+
4. VAE Decoder.
44+
Once the latent image is ready, the VAEDecoder.mlmodelc converts it into an actual picture (pixels).
45+
If you want to go the other way (picture → latent space), that’s what VAEEncoder.mlmodelc does.
46+
5. Safety check.
47+
Finally, SafetyChecker.mlmodelc looks at the image and makes sure it follows safety rules. If not, it may block or adjust it.
48+
It works by running the generated image through a separate classifier (basically another small neural net) that predicts if the picture belongs to any of these categories. If it does, the checker can either blur it, replace it, or just stop the output.
49+
2450
### Typical set of files for a model und the purpose of each file
2551

2652
| File Name | Description |

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /