Message 359409 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	Manishearth
Recipients	Bert JW Regeer, Guillaume Sanchez, Manishearth, Socob, _savage, benjamin.peterson, bianjp, ezio.melotti, lemburg, loewis, mcepl, methane, mrabarnett, p-ganssle, r.david.murray, scoder, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, xiang.zhang
Date	2020年01月06日.09:13:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1578301985.47.0.653194398748.issue30717@roundup.psfhosted.org>

Content
> one never needs to look at more than two adjacent code points to tell whether or not a grapheme break will occur between them, so this ought to be pretty efficient. That note is outdated (and has been outdated since Unicode 9). The regional indicator rules (GB12 and GB13) and the emoji rule (GB11) require arbitrary lookbehind (though thankfully not arbitrary lookahead). I think the ideal API surface is an iterator and nothing else. Everything else can be derived from the iterator. It's theoretically possible to expose an is_grapheme_break that's faster than just iterating -- look at the code in unicode-segmentation's _reverse_ iterator to see how -- but it's going to be tricky to get right. Building the iterator on top of is_grapheme_break is not a good idea.

Content

> one never needs to look at more than two adjacent code points to tell 
whether or not a grapheme break will occur between them, so this ought 
to be pretty efficient. 
That note is outdated (and has been outdated since Unicode 9). The regional indicator rules (GB12 and GB13) and the emoji rule (GB11) require arbitrary lookbehind (though thankfully not arbitrary lookahead).
I think the ideal API surface is an iterator and nothing else. Everything else can be derived from the iterator. It's theoretically possible to expose an is_grapheme_break that's faster than just iterating -- look at the code in unicode-segmentation's _reverse_ iterator to see how -- but it's going to be tricky to get right. Building the iterator on top of is_grapheme_break is not a good idea.

History
Date	User	Action	Args
2020年01月06日 09:13:05	Manishearth	set	recipients: + Manishearth, lemburg, loewis, terry.reedy, scoder, vstinner, benjamin.peterson, mcepl, ezio.melotti, mrabarnett, steven.daprano, r.david.murray, methane, serhiy.storchaka, _savage, xiang.zhang, p-ganssle, Socob, Guillaume Sanchez, Bert JW Regeer, bianjp
2020年01月06日 09:13:05	Manishearth	set	messageid: <1578301985.47.0.653194398748.issue30717@roundup.psfhosted.org>
2020年01月06日 09:13:05	Manishearth	link	issue30717 messages
2020年01月06日 09:13:05	Manishearth	create

homepage