Unicode Break Algorithmsπ i
Racket 8.7 added basic support for working with Unicode grapheme clusters, where multiple codepoints make up an entity that is rendered as a single character. This module expands that functionality, and adds word and sentence breaks from Unicode Annex #29, Text Segmentation. It does not attempt to provide language/locale specific algorithms.
The rules used are in accordance with Unicode 15.1, to match Racket 8.13.
1Grapheme Breaksπ i
Returns a sequence that produces a series of strings, one grapheme of the specified range of str per entry. It is undefined if start is not the initial index of a grapheme sequence.
Returns a list of the graphemes of the specified range of str. It is undefined if start is not the initial index of a grapheme sequence.
Returns a list of the starting indexes of each grapheme in the
specified range of str. It is undefined if start is not
the initial index of a grapheme sequence.
2Word Breaksπ i
Returns the Unicode word break property of the given character, which
is one of the following symbols:
' ALetter,
' CR,
' Double_Quote,
' Extend ' ExtendNumLet,
' Format,
' Hebrew_Letter,
' Katakana,
' LF,
' MidLetter,
' MidNum,
' MidNumLet,
' Newline,
' Numeric,
' Other,
' Regional_Indicator,
' Single_Quote,
' WSegSpace
or
' ZWJ.
Returns #t if a word break exists before the character at index i. There is always a break before start and end.
Returns the number of characters/codepoints in the string before the next Unicode word break starting from start and not going past end.
Returns a sequence that produces a series of strings, one word of the specified range of str per entry. If #:skip-blanks? is true, "words" that consist only of white space are omitted.
[ start
end
#:skip-blanks?skip-blanks?])
skip-blanks?:any/c=#f
Returns a list of the words in the specified range of str. If #:skip-blanks? is true, "words" that consist only of white space are omitted.
[ start
end
#:skip-blanks?skip-blanks?])
skip-blanks?:any/c=#f
Returns a list of the indexes of each word break in the specified range of str. The implicit breaks at the beginning and end of the string are included.
3Sentence Breaksπ i
Return the Unicode sentence break property of the given character, which is one of the following symbols:
' ATerm,
' CR,
' Close,
' Extend,
' Format,
' LF,
' Lower,
' Numeric,
' OLetter,
' Other,
' SContinue,
' STerm,
' Sep,
' Sp or
' Upper.
Returns a sequence that produces a series of strings, one sentence in the specified range of str per entry. It is undefined if start is not the initial index of a sentence.
Returns a list of the sentences of the specified range of str. It is undefined if start is not the initial index of a sentence.
( string-split-sentencess/immutable str [ start
end])
Same as string-split-sentencess, but returns immutable strings.
( string-sentence-indexesstr[startend])
Returns a list of the indexes of the start of each sentence in the specified range of str. It is undefined if start is not the initial index of a sentence.