You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+38-15Lines changed: 38 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,28 +24,51 @@ Enter a prompt or pick up a picture and press "Generate" (You don't need to prep
24
24
## How it works
25
25
26
26
### Super short
27
-
words → numbers → math → picture.
27
+
On the first glance it looks like a jungle of files (TextEncoder, U-Net, VAE, SafetyChecker, vocab stuff, etc.), but if you zoom out, the whole pipeline is really just:
28
+
29
+
**words → numbers → math → picture → check**
30
+
31
+
Everything else is just supporting that flow.
28
32
29
33
### So in short:
30
34
text → (TextEncoder) → numbers
31
35
numbers + noise → (U-Net) → hidden image
32
36
hidden image → (VAE Decoder) → real image
33
37
real image → (SafetyChecker) → safe output
34
38
35
-
### Basically
36
-
37
-
1. You type "a red apple". vocab.json + merges.txt handle tokenization → break it into units like [a] [red] [apple]. TextEncoder.mlmodelc maps those tokens into numerical vectors in latent space.
38
-
2. The models brain (U-Net).
39
-
It starts with just random noise (a messy canvas). Then, step by step, it removes noise and adds structure, following the instructions from your text (the numbers from the TextEncoder). After many steps, what was just noise slowly looks like the picture you asked for. At this stage, the image is not made of pixels like (red, green, blue dots). Instead, it lives in a latent space — basically a compressed mathematical version of the image.
40
-
3. Hidden space (latent space). It’s the hidden mathematical space where the U-Net operates. Latent space = a hidden, compressed version of images where the model does its work. Instead of dealing with millions of pixels directly (which is heavy), the model uses a smaller grid of numbers that still captures the essence of shapes, colors, and structures. Think of it like a sketch or a blueprint: not the full detailed image, but enough information to reconstruct it later. This is why it’s called latent (hidden): the image exists there, but only as math.
41
-
• Latent space = where (Think of it as the canvas the painter is working on)
42
-
• U-Net = how (Think of it as the work being done (the painter’s hand moving)
43
-
4. VAE Decoder.
44
-
Once the latent image is ready, the VAEDecoder.mlmodelc converts it into an actual picture (pixels).
45
-
If you want to go the other way (picture → latent space), that’s what VAEEncoder.mlmodelc does.
46
-
5. Safety check.
47
-
Finally, SafetyChecker.mlmodelc looks at the image and makes sure it follows safety rules. If not, it may block or adjust it.
48
-
It works by running the generated image through a separate classifier (basically another small neural net) that predicts if the picture belongs to any of these categories. If it does, the checker can either blur it, replace it, or just stop the output.
39
+
## Basically
40
+
41
+
1.**Text Encoding**
42
+
You type `"a red apple"`.
43
+
-`vocab.json` + `merges.txt` handle **tokenization** → break it into units like `[a] [red] [apple]`.
44
+
-`TextEncoder.mlmodelc` maps those tokens into **numerical vectors** in latent space.
45
+
46
+
2.**The model’s brain (U-Net)**
47
+
- Starts with **random noise** (a messy canvas).
48
+
- Step by step, it **removes noise** and **adds structure**, following the instructions from your text (the vectors from the TextEncoder).
49
+
- After many steps, what was just noise slowly looks like the picture you asked for.
50
+
- At this stage, the image is **not yet pixels** (red/green/blue dots). Instead, it exists in **latent space** — a compressed mathematical version of the image.
51
+
52
+
3.**Hidden space (Latent space)**
53
+
- Latent space = the **hidden mathematical space** where the U-Net operates.
54
+
- Instead of dealing with millions of pixels directly, the model works with a **smaller grid of numbers** that still captures the essence of shapes, colors, and structures.
55
+
- Think of it like a **sketch or blueprint**: not the full detailed image, but enough to reconstruct it later.
56
+
- That’s why it’s called *latent* (hidden): the image exists there only as math.
57
+
-**Latent space = where** → (the canvas the painter is working on).
58
+
-**U-Net = how** → (the painter’s hand shaping the canvas).
59
+
60
+
4.**VAE Decoder**
61
+
- Once the latent image is ready, `VAEDecoder.mlmodelc` converts it into a real picture (**pixels**).
62
+
- The opposite direction (picture → latent space) is done by `VAEEncoder.mlmodelc`.
63
+
64
+
5.**Safety check**
65
+
- Finally, `SafetyChecker.mlmodelc` looks at the generated image and checks if it follows **safety rules**.
66
+
- It runs the image through a separate classifier (another neural net) to predict if the image belongs to restricted categories (e.g. nudity, gore, etc.).
67
+
- If it does, the checker can:
68
+
- blur the image,
69
+
- block the image, or
70
+
- replace it with a placeholder.
71
+
49
72
50
73
### Typical set of files for a model und the purpose of each file
0 commit comments