BLOCKER — tag characters and bidi controls. These fail CI by default. There is no benign use, so there is no false-positive cost to refusing them outright.
MAJOR — zero-width and format characters (U+200B, U+200C, U+200D, U+2060, U+FEFF) anywhere they don't belong, plus other invisibles like soft hyphen (U+00AD), combining grapheme joiner (U+034F), Hangul fillers, and Khmer zero-width vowels. These can be attacks, but they also show up in legitimate-if-messy authoring. So they warn by default and only fail under --strict.
MINOR — mixed-script identifiers, but scoped hard. The homoglyph pass fires only inside URLs, package-manager install lines, and code-fence language tags. Never on prose.
That scoping is a deliberate design call. The line patterns the homoglyph pass inspects: https?://, npm/pnpm/yarn/bun install, pip/uv install, cargo install/add, brew/gem/composer/go install, and gh repo clone. Those are the lines a reader copies and runs. Prose is left alone because flagging every Greek letter in a math explanation would bury the one finding that matters.
Severity tiers at a glance
| Tier |
Character class |
Examples |
Action |
False-positive cost |
| BLOCKER |
Tag chars, bidi controls |
U+E0000–U+E007F, U+202A–U+202E |
Fail CI immediately |
None — no legitimate use |
| MAJOR |
Zero-width / format chars |
U+200B, U+FEFF (non-BOM), U+00AD |
Warn by default; fail under --strict
|
Possible in legitimate authoring |
| MINOR |
Mixed-script identifiers |
Cyrillic а, Greek α in URLs / install lines |
Warn in narrow contexts only |
Low — scoped to package lines |
The BOM exception
One codepoint sits on a fence. U+FEFF is a byte-order mark when it's the very first byte of a file — legitimate. The same codepoint anywhere else is a zero-width no-break space, which is exactly the kind of invisible an attacker reaches for. So the rule grants a pass to exactly one position and flags every other occurrence:
elif cp in ZERO_WIDTH_MAJOR:
# A single U+FEFF at the very first byte of the file is a
# legitimate BOM and gets a pass.
if cp == 0xFEFF and line_no == 1 and col_idx == 1:
continue
findings.append(Finding(severity="MAJOR", ..., rule="zero-width-or-format", ...))
Position-aware, not codepoint-aware. The byte is fine at offset 0 and suspect everywhere else.
Make the invisible visible in the log
A finding that says "MAJOR at line 14, column 22" is useless if the reviewer opens the file and sees nothing there — because the offending character is, by definition, invisible. Every finding carries file:line:column, the codepoint's unicodedata.name() label, the rule name, and a ~32-character context window with every invisible escaped to <U+XXXX>:
def _escape_context(line, column, width=32):
...
for ch in window:
cp = ord(ch)
if cp in TAG_CHARS or cp in BIDI_CONTROLS or cp in ZERO_WIDTH_MAJOR \
or cp in OTHER_INVISIBLE or cp < 0x20:
out_chars.append(f"<U+{cp:04X}>")
else:
out_chars.append(ch)
return "".join(out_chars)
Now the CI log shows npm install p<U+0430>ckage instead of a line that looks identical to the clean one. The reviewer can actually see the attack.
Ship report-only, then ratchet
The gate has two rollout switches. --warn-only always exits 0 — it reports findings without failing the build, for the window where you're still learning what's in the corpus. --strict flips MAJOR findings into build failures once you've cleaned up the known-benign noise.
This is the same self-expiring report-only pattern we use elsewhere: land a gate in advisory mode, let it observe production traffic, then enforce once you've proven it won't false-positive your own contributors into a wall. BLOCKER fires from day one because it has no false-positive cost; MAJOR waits behind --strict until the corpus is clean.
The result
Wired into .github/workflows/validate-plugins.yml next to the existing schema validator. Scanned 4,776 files.
Zero blockers. Clean main.
Eight MAJOR findings — and the honest detail is the interesting one. All eight traced to a single community-contributed file that intentionally used a zero-width space inside a fenced code block as a rendering workaround. Not an attack. A legitimate-but-messy authoring choice. That is precisely why MAJOR sits behind --strict and isn't flipped on yet: the ratchet waits until that one file is cleaned up, so the first enforced run doesn't punish a contributor for a cosmetic hack.
Tests cover the boundaries with six byte-precise fixtures — blocker-tag-chars, blocker-bidi-override, bom-allowed, clean-skill, major-zero-width, minor-homoglyph-install — driving an 8-test regression suite in tests/test_validate_unicode_hygiene.py. The whole thing shipped as PR #777 closing #776: ~317 lines of validator, one workflow edit, one test file.
Also shipped
Same day, part of a wider CI-hardening campaign:
- An after-action review (PR #775) closed a 2026年05月22日→24 hardening sequence: 11 PRs landing v4.32.0 with 10 blocking required gates and zero report-only, plus cleanup of 974 Python errors, 223 shellcheck warnings, ~60k markdown issues, and 970 MB freed.
- Doc-quality gate "round 2" in two other repos (
intentional-cognition-os, qmd-team-intent-kb): fixed Vale by scoping via per-directory .vale.ini sections instead of the action's broken single-path files: input, and lychee by passing . as the required positional argument. Scoping refinements, not policy loosening — the gates stay BLOCKING.
-
intentional-cognition-os also got an ico audit verify SHA-256 chain verifier. The audit log had carried a tamper-evident hash chain since launch, but nothing actually walked it — so the tamper-evidence was theoretical. Now ico audit verify exits 2 on AUDIT_TAMPERED.
Related posts
{
"@context": "https://schema.org",
"@type": "BlogPosting",
"headline": "The Unicode Layer Your Validator Can't See",
"description": "Schema validation can't see invisible Unicode. A stdlib-only CI gate that catches tag-char injection, Trojan Source bidi overrides, and homoglyph attacks.",
"image": "https://startaitools.com/images/og-image.png",
"datePublished": "2026-05-24",
"author": {
"@type": "Person",
"name": "Jeremy Longshore"
},
"publisher": {
"@type": "Organization",
"name": "Start AI Tools"
},
"url": "https://startaitools.com/posts/unicode-hygiene-gate-same-day-trapdoor-defense/"
}