Racket 8.7 added basic support for working with Unicode grapheme clusters, where multiple codepoints make up an entity that is rendered as a single character. This module expands that functionality, and adds word and sentence breaks from Unicode Annex #29, Text Segmentation. It does not attempt to provide language/locale specific algorithms.

The rules used are in accordance with Unicode 15.1, to match Racket 8.13.

1Grapheme Breaks🔗 i

procedure
( in-graphemes str[startend])→(sequence/c string? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a sequence that produces a series of strings, one grapheme of the specified range of str per entry. It is undefined if start is not the initial index of a grapheme sequence.

procedure
( string-split-graphemes str[startend])→(listofstring? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a list of the graphemes of the specified range of str. It is undefined if start is not the initial index of a grapheme sequence.

procedure
( string-split-graphemes/immutable str
[ start
end])
→(listof(and/cstring? immutable? ))
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Same as string-split-graphemes , but returns immutable strings.

procedure
( string-grapheme-indexes str[startend])
→(listofexact-nonnegative-integer? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a list of the starting indexes of each grapheme in the specified range of str. It is undefined if start is not the initial index of a grapheme sequence.

2Word Breaks🔗 i

procedure
( char-word-break-property ch)→symbol?
ch:char?

Returns the Unicode word break property of the given character, which is one of the following symbols: ' ALetter, ' CR, ' Double_Quote, ' Extend ' ExtendNumLet, ' Format, ' Hebrew_Letter, ' Katakana, ' LF, ' MidLetter, ' MidNum, ' MidNumLet, ' Newline, ' Numeric, ' Other, ' Regional_Indicator, ' Single_Quote, ' WSegSpace or ' ZWJ.

procedure
( string-word-break-at? stri[startend])→boolean?
str:string?
i:exact-nonnegative-integer?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns #t if a word break exists before the character at index i. There is always a break before start and end.

procedure
( string-word-span strstart[end])→exact-nonnegative-integer?
str:string?
start:exact-nonnegative-integer?
end:exact-nonnegative-integer? =(string-length str)

Returns the number of characters/codepoints in the string before the next Unicode word break starting from start and not going past end.

procedure
( in-words str
[ start
end
#:skip-blanks?skip-blanks?]) → (sequence/c string? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)
skip-blanks?:any/c=#f

Returns a sequence that produces a series of strings, one word of the specified range of str per entry. If #:skip-blanks? is true, "words" that consist only of white space are omitted.

procedure
( string-split-words str
[ start
end
#:skip-blanks?skip-blanks?])
→(listofstring? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)
skip-blanks?:any/c=#f

Returns a list of the words in the specified range of str. If #:skip-blanks? is true, "words" that consist only of white space are omitted.

procedure
( string-split-words/immutable str
[ start
end
#:skip-blanks?skip-blanks?])
→(listof(and/cstring? immutable? ))
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)
skip-blanks?:any/c=#f

Same as string-split-words , but returns immutable strings.

procedure
( string-word-break-indexes str[startend])
→(listofexact-nonnegative-integer? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a list of the indexes of each word break in the specified range of str. The implicit breaks at the beginning and end of the string are included.

3Sentence Breaks🔗 i

procedure
( char-sentence-break-property ch)→symbol?
ch:char?

Return the Unicode sentence break property of the given character, which is one of the following symbols: ' ATerm, ' CR, ' Close, ' Extend, ' Format, ' LF, ' Lower, ' Numeric, ' OLetter, ' Other, ' SContinue, ' STerm, ' Sep, ' Sp or ' Upper.

procedure
( in-sentences str[startend])→(sequence/c string? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a sequence that produces a series of strings, one sentence in the specified range of str per entry. It is undefined if start is not the initial index of a sentence.

procedure
( string-split-sentences str[startend])→(listofstring? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a list of the sentences of the specified range of str. It is undefined if start is not the initial index of a sentence.

procedure
( string-split-sentencess/immutable str
[ start
end])
→(listof(and/cstring? immutable? ))
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Same as string-split-sentencess, but returns immutable strings.

procedure
( string-sentence-indexesstr[startend])
→(listofexact-nonnegative-integer? )
str:string?
start:exact-nonnegative-integer? =0
end:exact-nonnegative-integer? =(string-length str)

Returns a list of the indexes of the start of each sentence in the specified range of str. It is undefined if start is not the initial index of a sentence.

4Other functions🔗 i

procedure
( char-east-asian-width-property ch)→(or/c'N'Na'H'A'F'W)
ch:char?

Returns the Annex #11 East Asian Width property assigned to the given character.

top ← prev up next →