Message359408
| Author |
steven.daprano |
| Recipients |
Bert JW Regeer, Guillaume Sanchez, Manishearth, Socob, _savage, benjamin.peterson, bianjp, ezio.melotti, lemburg, loewis, mcepl, methane, mrabarnett, p-ganssle, r.david.murray, scoder, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, xiang.zhang |
| Date |
2020年01月06日.08:44:46 |
| SpamBayes Score |
-1.0 |
| Marked as misclassified |
Yes |
| Message-id |
<20200106084438.GA839@ando.pearwood.info> |
| In-reply-to |
<1578280890.0.0.889384912462.issue30717@roundup.psfhosted.org> |
| Content |
> I think it would be a mistake to make the stdlib use this for most
> notions of what a "character" is, as I said this notion is also
> inaccurate. Having an iterator library somewhere that you can use and
> compose is great, changing the internal workings of string operations
> would be a major change, and not entirely productive.
Agreed.
I won't pretend to be able to predict what Python 5.0 will bring *wink*
but there's too much history around the "code point = character" notion
for the language to change now.
If the language can expose a grapheme iterator, then people can
experiment with grapheme-based APIs in libraries.
(By grapheme I mean "extended grapheme cluster", but that's a mouthful.
Sorry linguists.)
What do you think of these as a set of grapheme primitives?
(1) is_grapheme_break(string, i)
Return True if a grapheme break would occur *before* string[i].
(2) graphemes(string, start=0, end=len(string))
Iterate over graphemes in string[start:end].
(3) graphemes_reversed(string, start=0, end=len(string))
Iterate over graphemes in reverse order.
I *think* is_grapheme_break would be enough for people to implement
their own versions of graphemes and graphemes_reversed. Here's an
untested version:
def graphemes(string, start, end):
cluster = []
for i in range(start, end):
c = string[i]
if is_grapheme_break(string, i):
if i != start:
# don't yield the empty cluster at Start Of Text
yield ''.join(cluster)
cluster = [c]
else:
cluster.append(c)
if cluster:
yield ''.join(cluster)
Regarding is_grapheme_break, if I understand the note here:
https://www.unicode.org/reports/tr29/#Testing
one never needs to look at more than two adjacent code points to tell
whether or not a grapheme break will occur between them, so this ought
to be pretty efficient. At worst, it needs to look at string[i-1] and
string[i], if they exist. |
|
History
|
|---|
| Date |
User |
Action |
Args |
| 2020年01月06日 08:44:47 | steven.daprano | set | recipients:
+ steven.daprano, lemburg, loewis, terry.reedy, scoder, vstinner, benjamin.peterson, mcepl, ezio.melotti, mrabarnett, r.david.murray, methane, serhiy.storchaka, _savage, xiang.zhang, p-ganssle, Socob, Guillaume Sanchez, Bert JW Regeer, bianjp, Manishearth |
| 2020年01月06日 08:44:47 | steven.daprano | link | issue30717 messages |
| 2020年01月06日 08:44:46 | steven.daprano | create |
|