- Zig 99.1%
- CSS 0.6%
- HTML 0.2%
| src | Replace std.mem.copy with @memcpy | |
| unicode_license | BREAKING; Major update to use build system for code gen. | |
| .gitattributes | add .gitignore and .gitattributes | |
| .gitignore | Collator rewrite | |
| build.zig | Moved all Unicode tests to unicode_tests.zig | |
| build.zig.zon | Fix "never mutated var" errors | |
| LICENSE | Initial commit | |
| README.md | Updated REAME with codeberg url | |
| UNICODE_VERSION.txt | BREAKING; Major update to use build system for code gen. | |
ziglyph
Unicode text processing for the Zig Programming Language.
In-Depth Articles on Unicode Processing with Zig and Ziglyph
The Unicode Processing with Zig series of articles over on ZigNEWS covers important aspects of Unicode in general and in particular how to use this library to process Unicode text. Note that the examples in that series are pre-Zig v0.11, so changes may be necessary to make them work.
Status
This is pre-1.0 software. Although breaking changes are less frequent with each minor version release, they will still occur until we reach 1.0.
Zig Version
The main branch follows Zig's master branch, which is the latest dev version of Zig. There will also be branches and tags that will work with the previous two (2) stable Zig releases.
Integrating Ziglyph in your Project
Zig Package Manager
In the build.zig.zon file, add the following to the dependencies object.
.ziglyph=.{.url="https://codeberg.org/dude_the_builder/ziglyph/archive/v0.11.1.tar.gz",}The compiler will produce a hash mismatch error, add the .hash field to build.zig.zon
with the hash the compiler tells you it found.
Then in your build.zig file add the following to the exe section for the executable where you wish to have Ziglyph available.
constziglyph=b.dependency("ziglyph",.{.optimize=optimize,.target=target,});// for exe, lib, tests, etc.
exe.addModule("ziglyph",ziglyph.module("ziglyph"));Now in the code, you can import components like this:
constziglyph=@import("ziglyph");orconstletter=ziglyph.letter;constnumber=ziglyph.number;Using the ziglyph Namespace
The ziglyph namespace provides convenient acces to the most frequently-used functions related to Unicode
code points and strings.
constziglyph=@import("ziglyph");test"ziglyph namespace"{constz='z';tryexpect(ziglyph.isLetter(z));tryexpect(ziglyph.isAlphaNum(z));tryexpect(ziglyph.isPrint(z));tryexpect(!ziglyph.isUpper(z));constuz=ziglyph.toUpper(z);tryexpect(ziglyph.isUpper(uz));tryexpectEqual(uz,'Z');// String toLower, toTitle, and toUpper.
varallocator=std.testing.allocator;vargot=tryziglyph.toLowerStr(allocator,"AbC123");errdeferallocator.free(got);tryexpect(std.mem.eql(u8,"abc123",got));allocator.free(got);got=tryziglyph.toUpperStr(allocator,"aBc123");errdeferallocator.free(got);tryexpect(std.mem.eql(u8,"ABC123",got));allocator.free(got);got=tryziglyph.toTitleStr(allocator,"thE aBc123 moVie. yes!");deferallocator.free(got);tryexpect(std.mem.eql(u8,"The Abc123 Movie. Yes!",got));}Category Namespaces
Namespaces for frequently-used Unicode General Categories are available. See ziglyph.zig for a full list of all components.
constletter=@import("ziglyph").letter;constpunct=@import("ziglyph").punct;test"Category namespaces"{constz='z';tryexpect(letter.isletter(z));tryexpect(!letter.isUpper(z));tryexpect(!punct.ispunct(z));tryexpect(punct.ispunct('!'));constuz=letter.toUpper(z);tryexpect(letter.isUpper(uz));tryexpectEqual(uz,'Z');}Normalization
In addition to the basic functions to detect and convert code point case, the Normalizer struct
provides code point and string normalization methods. All normalization forms are supported (NFC,
NFKC, NFD, NFKD.).
constNormalizer=@import("ziglyph").Normalizer;test"normalizeTo"{varallocator=std.testing.allocator;varnormalizer=tryNormalizer.init(allocator);defernormalizer.deinit();// Canonical Composition (NFC)
constinput_nfc="Complex char: \u{03D2}\u{0301}";constwant_nfc="Complex char: \u{03D3}";vargot_nfc=trynormalizer.nfc(allocator,input_nfc);defergot_nfc.deinit();trytesting.expectEqualSlices(u8,want_nfc,got_nfc.slice);// Compatibility Composition (NFKC)
constinput_nfkc="Complex char: \u{03A5}\u{0301}";constwant_nfkc="Complex char: \u{038E}";vargot_nfkc=trynormalizer.nfkc(allocator,input_nfkc);defergot_nfkc.deinit();trytesting.expectEqualSlices(u8,want_nfkc,got_nfkc.slice);// Canonical Decomposition (NFD)
constinput_nfd="Complex char: \u{03D3}";constwant_nfd="Complex char: \u{03D2}\u{0301}";vargot_nfd=trynormalizer.nfd(allocator,input_nfd);defergot_nfd.deinit();trytesting.expectEqualSlices(u8,want_nfd,got_nfd.slice);// Compatibility Decomposition (NFKD)
constinput_nfkd="Complex char: \u{03D3}";constwant_nfkd="Complex char: \u{03A5}\u{0301}";vargot_nfkd=trynormalizer.nfkd(allocator,input_nfkd);defergot_nfkd.deinit();trytesting.expectEqualSlices(u8,want_nfkd,got_nfkd.slice);// String comparisons.
trytesting.expect(trynormalizer.eql(allocator,"foé","foe\u{0301}"));trytesting.expect(trynormalizer.eql(allocator,"foΎ","fo\u{03D2}\u{0301}"));trytesting.expect(trynormalizer.eqlCaseless(allocator,"FoΎ","fo\u{03D2}\u{0301}"));trytesting.expect(trynormalizer.eqlCaseless(allocator,"FOÉ","foe\u{0301}"));// foÉ == foé
// Note: eqlIdentifiers is not a method, it's just a function in the Normalizer namespace.
trytesting.expect(tryNormalizer.eqlIdentifiers(allocator,"Foé","foé"));// Unicode Identifiers caseless match.
}Collation (String Ordering)
One of the most common operations required by string processing is sorting and ordering comparisons.
The Unicode Collation Algorithm was developed to attend this area of string processing. The Collator
struct implements the algorithm, allowing for proper sorting and order comparison of Unicode strings.
constCollator=@import("ziglyph").Collator;test"Collation"{varc=tryCollator.init(std.testing.allocator);deferc.deinit();// Ascending / descending sort
varstrings=[_][]constu8{"def","xyz","abc"};varwant=[_][]constu8{"abc","def","xyz"};std.mem.sort([]constu8,&strings,c,Collator.ascending);trystd.testing.expectEqualSlices([]constu8,&want,&strings);want=[_][]constu8{"xyz","def","abc"};std.mem.sort([]constu8,&strings,c,Collator.descending);trystd.testing.expectEqualSlices([]constu8,&want,&strings);// Caseless sorting
strings=[_][]constu8{"def","Abc","abc"};want=[_][]constu8{"Abc","abc","def"};std.mem.sort([]constu8,&strings,c,Collator.ascendingCaseless);trystd.testing.expectEqualSlices([]constu8,&want,&strings);want=[_][]constu8{"def","Abc","abc"};std.mem.sort([]constu8,&strings,c,Collator.descendingCaseless);trystd.testing.expectEqualSlices([]constu8,&want,&strings);// Caseless / markless sorting
strings=[_][]constu8{"ábc","Abc","abc"};want=[_][]constu8{"ábc","Abc","abc"};std.mem.sort([]constu8,&strings,c,Collator.ascendingBase);trystd.testing.expectEqualSlices([]constu8,&want,&strings);std.mem.sort([]constu8,&strings,c,Collator.descendingBase);trystd.testing.expectEqualSlices([]constu8,&want,&strings);}Tailoring with allkeys.txt
You can tailor the sorting of Unicode text by modifying the sort element weights found in the Unicode data file: allkeys.txt. This will require that you have use a vendored Ziglyph dependency, since you will be tailoring it to your specific needs. The process is as follows:
$ cd /<path to your project root>/
$ mkdir deps && cd deps
$ git clone https://github.com/jecolon/ziglyph
$ cd ziglyph/
$ zig build fetch
$ cp zig-cache/_ziglyph-data/uca/allkeys.txt src/data/tailor/
$ vim src/data/tailor/allkeys.txt # <- Modify the file
$ zig build akcompress
[...output snipped...]
$ mv src/data/allkeys-diffs.txt.deflate src/data/allkeys-diffs.txt.deflate.bak # <- backup original
$ mv src/data/tailor/allkeys-diffs.txt.deflate src/data/
$ cd /<path to your project root>/
Now to use this vendored Ziglyph dependency, replace the b.dependency call described above in build.zig with:
constziglyph=b.anonymousDependency("deps/ziglyph",@import("deps/ziglyph/build.zig"),.{.optimize=optimize,.target=target,});Text Segmentation (Grapheme Clusters, Words, Sentences)
Ziglyph has iterators to traverse text as Grapheme Clusters (what most people recognize as characters), Words, and Sentences. All of these text segmentation functions adhere to the Unicode Text Segmentation rules, which may surprise you in terms of what's included and excluded at each break point. Test before assuming any results!
constGrapheme=@import("ziglyph").Grapheme;constGraphemeIterator=Grapheme.GraphemeIterator;constSentenceIterator=Sentence.SentenceIterator;constComptimeSentenceIterator=Sentence.ComptimeSentenceIterator;constWord=@import("ziglyph").Word;constWordIterator=Word.WordIterator;test"GraphemeIterator"{constinput="H\u{0065}\u{0301}llo";variter=GraphemeIterator.init(input);constwant=&[_][]constu8{"H","\u{0065}\u{0301}","l","l","o"};vari:usize=0;while(iter.next())|grapheme|:(i+=1){trytesting.expect(grapheme.eql(input,want[i]));}// Need your grapheme clusters at compile time?
comptime{varct_iter=GraphemeIterator.init(input);varj=0;while(ct_iter.next())|grapheme|:(j+=1){trytesting.expect(grapheme.eql(input,want[j]));}}}test"SentenceIterator"{varallocator=std.testing.allocator;constinput=\\("Go.") ("He said.");variter=trySentenceIterator.init(allocator,input);deferiter.deinit();// Note the space after the closing right parenthesis is included as part
// of the first sentence.
consts1=\\("Go.") ;consts2=\\("He said.");constwant=&[_][]constu8{s1,s2};vari:usize=0;while(iter.next())|sentence|:(i+=1){trytesting.expectEqualStrings(sentence.bytes,want[i]);}// Need your sentences at compile time?
@setEvalBranchQuota(2_000);comptimevarct_iter=ComptimeSentenceIterator(input){};constn=comptimect_iter.count();varsentences:[n]Sentence=undefined;comptime{varct_i:usize=0;while(ct_iter.next())|sentence|:(ct_i+=1){sentences[ct_i]=sentence;}}for(sentences)|sentence,j|{trytesting.expect(sentence.eql(want[j]));}}test"WordIterator"{constinput="The (quick) fox. Fast! ";variter=tryWordIterator.init(input);constwant=&[_][]constu8{"The"," ","(","quick",")"," ","fox","."," ","Fast","!"," "};vari:usize=0;while(iter.next())|word|:(i+=1){trytesting.expectEqualStrings(word.bytes,want[i]);}// Need your words at compile time?
@setEvalBranchQuota(2_000);comptime{varct_iter=tryWordIterator.init(input);varj=0;while(ct_iter.next())|word|:(j+=1){trytesting.expect(word.eql(want[j]));}}}Code Point and String Display Width
When working with environments in which text is rendered in a fixed-width font, such as terminal
emulators, it's necessary to know how many cells (or columns) a particular code point or string will
occupy. The display_width namespace provides functions to do just that.
constdw=@import("ziglyph").display_width;test"Code point / string widths"{// The width methods take a second parameter of value .half or .full to determine the width of
// ambiguous code points as per the Unicode standard. .half is the most common case.
// Note that codePointWidth returns an i3 because code points like backspace have width -1.
tryexpectEqual(dw.codePointWidth('é',.half),1);tryexpectEqual(dw.codePointWidth('😊',.half),2);tryexpectEqual(dw.codePointWidth('统',.half),2);varallocator=std.testing.allocator;// strWidth returns usize because it can never be negative, regardless of the code points it contains.
tryexpectEqual(trydw.strWidth("Hello\r\n",.half),5);tryexpectEqual(trydw.strWidth("\u{1F476}\u{1F3FF}\u{0308}\u{200D}\u{1F476}\u{1F3FF}",.half),2);tryexpectEqual(trydw.strWidth("Héllo 🇵🇷",.half),8);tryexpectEqual(trydw.strWidth("\u{26A1}\u{FE0E}",.half),1);// Text sequence
tryexpectEqual(trydw.strWidth("\u{26A1}\u{FE0F}",.half),2);// Presentation sequence
// padLeft, center, padRight
constright_aligned=trydw.padLeft(allocator,"w😊w",10,"-");deferallocator.free(right_aligned);tryexpectEqualSlices(u8,"------w😊w",right_aligned);constcentered=trydw.center(allocator,"w😊w",10,"-");deferallocator.free(centered);tryexpectEqualSlices(u8,"---w😊w---",centered);constleft_aligned=trydw.padRight(allocator,"w😊w",10,"-");deferallocator.free(left_aligned);tryexpectEqualSlices(u8,"w😊w------",left_aligned);}Word Wrap
If you need to wrap a string to a specific number of columns according to Unicode Word boundaries and display width,
you can use the display_width struct's wrap function for this. You can also specify a threshold value indicating how close
a word boundary can be to the column limit and trigger a line break.
constdw=@import("ziglyph").display_width;test"display_width wrap"{varallocator=testing.allocator;varinput="The quick brown fox\r\njumped over the lazy dog!";vargot=trydw.wrap(allocator,input,10,3);deferallocator.free(got);varwant="The quick\n brown \nfox jumped\n over the\n lazy dog\n!";trytesting.expectEqualStrings(want,got);}