Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit 9cd2ad9

Browse files
Update README.md
1 parent dac9184 commit 9cd2ad9

File tree

1 file changed

+38
-15
lines changed

1 file changed

+38
-15
lines changed

‎README.md

Lines changed: 38 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -24,28 +24,51 @@ Enter a prompt or pick up a picture and press "Generate" (You don't need to prep
2424
## How it works
2525

2626
### Super short
27-
words → numbers → math → picture.
27+
On the first glance it looks like a jungle of files (TextEncoder, U-Net, VAE, SafetyChecker, vocab stuff, etc.), but if you zoom out, the whole pipeline is really just:
28+
29+
**words → numbers → math → picture → check**
30+
31+
Everything else is just supporting that flow.
2832

2933
### So in short:
3034
text → (TextEncoder) → numbers
3135
numbers + noise → (U-Net) → hidden image
3236
hidden image → (VAE Decoder) → real image
3337
real image → (SafetyChecker) → safe output
3438

35-
### Basically
36-
37-
1. You type "a red apple". vocab.json + merges.txt handle tokenization → break it into units like [a] [red] [apple]. TextEncoder.mlmodelc maps those tokens into numerical vectors in latent space.
38-
2. The models brain (U-Net).
39-
It starts with just random noise (a messy canvas). Then, step by step, it removes noise and adds structure, following the instructions from your text (the numbers from the TextEncoder). After many steps, what was just noise slowly looks like the picture you asked for. At this stage, the image is not made of pixels like (red, green, blue dots). Instead, it lives in a latent space — basically a compressed mathematical version of the image.
40-
3. Hidden space (latent space). It’s the hidden mathematical space where the U-Net operates. Latent space = a hidden, compressed version of images where the model does its work. Instead of dealing with millions of pixels directly (which is heavy), the model uses a smaller grid of numbers that still captures the essence of shapes, colors, and structures. Think of it like a sketch or a blueprint: not the full detailed image, but enough information to reconstruct it later. This is why it’s called latent (hidden): the image exists there, but only as math.
41-
• Latent space = where (Think of it as the canvas the painter is working on)
42-
• U-Net = how (Think of it as the work being done (the painter’s hand moving)
43-
4. VAE Decoder.
44-
Once the latent image is ready, the VAEDecoder.mlmodelc converts it into an actual picture (pixels).
45-
If you want to go the other way (picture → latent space), that’s what VAEEncoder.mlmodelc does.
46-
5. Safety check.
47-
Finally, SafetyChecker.mlmodelc looks at the image and makes sure it follows safety rules. If not, it may block or adjust it.
48-
It works by running the generated image through a separate classifier (basically another small neural net) that predicts if the picture belongs to any of these categories. If it does, the checker can either blur it, replace it, or just stop the output.
39+
## Basically
40+
41+
1. **Text Encoding**
42+
You type `"a red apple"`.
43+
- `vocab.json` + `merges.txt` handle **tokenization** → break it into units like `[a] [red] [apple]`.
44+
- `TextEncoder.mlmodelc` maps those tokens into **numerical vectors** in latent space.
45+
46+
2. **The model’s brain (U-Net)**
47+
- Starts with **random noise** (a messy canvas).
48+
- Step by step, it **removes noise** and **adds structure**, following the instructions from your text (the vectors from the TextEncoder).
49+
- After many steps, what was just noise slowly looks like the picture you asked for.
50+
- At this stage, the image is **not yet pixels** (red/green/blue dots). Instead, it exists in **latent space** — a compressed mathematical version of the image.
51+
52+
3. **Hidden space (Latent space)**
53+
- Latent space = the **hidden mathematical space** where the U-Net operates.
54+
- Instead of dealing with millions of pixels directly, the model works with a **smaller grid of numbers** that still captures the essence of shapes, colors, and structures.
55+
- Think of it like a **sketch or blueprint**: not the full detailed image, but enough to reconstruct it later.
56+
- That’s why it’s called *latent* (hidden): the image exists there only as math.
57+
- **Latent space = where** → (the canvas the painter is working on).
58+
- **U-Net = how** → (the painter’s hand shaping the canvas).
59+
60+
4. **VAE Decoder**
61+
- Once the latent image is ready, `VAEDecoder.mlmodelc` converts it into a real picture (**pixels**).
62+
- The opposite direction (picture → latent space) is done by `VAEEncoder.mlmodelc`.
63+
64+
5. **Safety check**
65+
- Finally, `SafetyChecker.mlmodelc` looks at the generated image and checks if it follows **safety rules**.
66+
- It runs the image through a separate classifier (another neural net) to predict if the image belongs to restricted categories (e.g. nudity, gore, etc.).
67+
- If it does, the checker can:
68+
- blur the image,
69+
- block the image, or
70+
- replace it with a placeholder.
71+
4972

5073
### Typical set of files for a model und the purpose of each file
5174

0 commit comments

Comments
(0)

AltStyle によって変換されたページ (->オリジナル) /