Unicode Standard Annex 29 titled "Unicode Text Segmentation" defines rules for word boundaries, grapheme boundaries, and sentence boundaries. Perl supports all three in its regex flavor since Perl 5.22. ICU only supports the word boundaries. Java 9 and later support the grapheme boundaries.

Unicode Standard Annex 14 titled "Unicode Line Breaking Algorithm" defines an algorithm for finding potential word wrapping positions. Perl 5.24 and later can match these positions as line boundaries.

Unicode Standard Annex 29 Word Boundaries

As discussed earlier in this tutorial, regular expression word boundaries match at a position that is either preceded by a word character, followed by a word character, but not both. With most regex flavors, a "word character" is a character matched by the shorthand \w.

UAX #29 defines a more complex kind of word boundary. This word boundary has no relation to the characters matched by \w. It is based on a set of rules that try to more closely match the concept of a position where a word begins or ends. Perl 5.22 and later provide \b{wb} to match at a position that is a word break position according to the rules of UAX #29. \B{wb} matches all other positions. \b remains a traditional word boundary in Perl. ICU has the mode modifier (?w) and the flag UREGEX_UWORD to change \b to match at a UAX #29 word break position and \B to match all other positions.

\b{wb} and (?w)\b can produce significantly different results compared with the traditional \b. A key consideration is that UAX #29 is more geared towards finding potential word break positions than the boundaries of actual words. \b does not match the string !!! at all because it contains no word characters. But \b{wb} matches at all 4 positions in this string because an algorithm that chops a string into words (such as for word wrapping purposes) is free to break before and after an exclamation point.

UAX #29 treats underscores and other connector punctuation as word characters. \b{wb} only matches at the start and the end of the strings one_word and _underscore_.

UAX #29 ignores hyphens, so \b{wb} matches before and after any hyphen. But it will not match before or after an apostrophe if it is between two letters. So \b{wb} matches only at the start and end of John's, but finds 4 matches in 'John's' (start of string, after opening quote, before closing quote, and end of string). Do keep in mind that the meaning of \w hasn’t changed. \b{wb}\w+\b{wb} cannot match any part of the previous two strings because \w never matches the apostrophe.

UAX #29 breaks before and after ideographs. So \b{wb} matches at all 5 positions in 中文单词. Ideographs are usually written without spaces between words. Basic word wrapping algorithms break between any two ideographs.

In Unicode 10.0.0 and prior, UAX #29 ignored horizontal whitespace. This means that in ICU 61 and prior, (?w)\b matches before and after every space. Unicode 11.0.0 added rule WB3d to keep horizontal whitespace together. So in ICU 62 and later, (?w)\b does not match between two spaces. But UAX #29 has always maintained that a word break position exists before and after every line break, except in the middle of a CRLF pair, because word wrapping should preserve explicit line breaks. UAX #29 does not treat a tab as horizontal whitespace and thus will break before and after tabs.

Perl’s developers disagreed. While Perl 5.22—which was based on Unicode 7.0.0—matches \b{wb} before and after every space, tab, and line break, Perl 5.24 and later do not allow \b{wb} to match between two characters that are matched by \s, which includes all spaces, tabs, and line breaks. Even Perl 5.30 and later—which are based on Unicode 11.0.0 or later—have stuck with the rule of not allowing \b{wb} to match between two line breaks.

Unicode Standard Annex 29 Grapheme Boundaries

We have previously discussed the difference between code points and graphemes and how regex engines treat code points as characters but people are more likely to see graphemes as characters. A grapheme boundary is a position between one grapheme and the next, or before or after the first or last grapheme in a string. The string àé encoded as 4 code points U+0061 U+0300 U+0065 U+0301 using 2 ASCII letter and 2 combining diacritics has 3 grapheme boundary positions: the start of the string, between U+0300 (grave accent) and U+0065 (letter e) and at the end of the string.

In Perl 5.22 and Java 7 and later versions of these, \b{g} matches at these grapheme boundary positions. Perl also supports the alternative syntax \b{gcb}. The letters gcb stand for "grapheme cluster boundary". In Unicode parlance, the code points U+0061 U+0300 form a grapheme cluster that represent the grapheme à.

Perl, but not Java, supports \B{g} and \B{gcb} to match any position between two code points that is not a grapheme cluster boundary. In our sample string these find two matches. The first match is the position between U+0061 (letter a) and U+0300 (grave accent). The second match is the position between U+0065 (letter e) and U+0301 (acute accent).

Unicode Standard Annex 29 Sentence Boundaries

Sentence boundaries are intended to allow applications to select entire sentences or to process text sentence by sentence. A sentence boundary matches at any position where the previous sentence ends and the next sentence begins, as well as at the start and end of any string. Perl 5.22 and later match these positions with \b{sb}. All other positions can be matched with \B{sb}.

UAX #29’s sentence boundary rules are a lot smarter than just treating every full stop as the end of a sentence. But they’re not perfect. In the string "Dr. John works at I.B.M., doesn't he?", asked Alice. "Yes," replied Charlie., the regex \b{sb}.+?\b{sb} finds 3 matches: "Dr. , John works at I.B.M., doesn't he?", asked Alice. , and "Yes," replied Charlie.. A full stop ends a sentence if it is followed by a capital letter. The question mark does not trigger a sentence break because of the comma that follows, even with the quote in between.

Unicode Standard Annex 14 Line Boundaries

UAX #14 provides a complex set of rules for finding positions in a string or file where it would be appropriate to break it into multiple lines in order to fit the available horizontal space, such as when a word processor should advance to the next line in order for the text to not run beyond the right hand margin of the page. For European languages a simple algorithm that wraps after spaces and hyphens is usually sufficient. But many scripts don’t use spaces to separate words. UAX #14 attempts to handle those appropriately by providing breaking points where it makes sense, such as between ideographs, but not where it would hurt readability, such as after opening quotes or before closing quotes.

Perl 5.24 and later match these positions with \b{lb}. All other positions can be matched with \B{lb}. Of note, all the UAX #29 boundaries match at the start and end of the string. But \b{lb} never matches at the start of the string because word wrapping shouldn’t add a blank line at the start of the text. \b{lb} does always match at the end of the string. The inverse \B{lb} always matches at the start of the string and never at the end of the string.