homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: unicodedata.itergraphemes / str.itergraphemes / str.graphemes
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Add unicode grapheme cluster break algorithm
View: 30717
Assigned To: Nosy List: Socob, benjamin.peterson, cvrebert, dpk, ezio.melotti, lemburg, loewis, mrabarnett, serhiy.storchaka
Priority: normal Keywords:

Created on 2013年07月08日 18:25 by dpk, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (4)
msg192684 - (view) Author: David P. Kendal (dpk) Date: 2013年07月08日 18:25
On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <http://mail.python.org/pipermail/python-ideas/2013-July/021916.html>
I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <http://mail.python.org/pipermail/python-ideas/2013-July/021917.html>
M.-A. Lenburg asked me to open this issue. <http://mail.python.org/pipermail/python-ideas/2013-July/021929.html>
msg192724 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013年07月09日 07:42
It may be useful to also add the start position of the grapheme to the iterator output.
Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module:
http://mail.python.org/pipermail/python-dev/2001-July/015938.html 
msg192769 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013年07月09日 17:25
This is basically what the regex module does, written in Python:
 def get_grapheme_cluster_break(codepoint):
 """Gets the "Grapheme Cluster Break" property of a codepoint.
 
 The properties defined here:
 
 http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
 """
 # The return value is one of:
 #
 # "Other"
 # "CR"
 # "LF"
 # "Control"
 # "Extend"
 # "Prepend"
 # "Regional_Indicator"
 # "SpacingMark"
 # "L"
 # "V"
 # "T"
 # "LV"
 # "LVT"
 ...
 
 def at_grapheme_boundary(string, index):
 """Checks whether the codepoint at 'index' is on a grapheme boundary.
 
 The rules are defined here:
 
 http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
 """
 # Break at the start and end of the text.
 if index <= 0 or index >= len(string):
 return True
 
 prop = get_grapheme_cluster_break(string[index])
 prop_m1 = get_grapheme_cluster_break(string[index - 1])
 
 # Don't break within CRLF.
 if prop_m1 == "CR" and prop == "LF":
 return False
 
 # Otherwise break before and after controls (including CR and LF).
 if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"):
 return True
 
 # Don't break Hangul syllable sequences.
 if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"):
 return False
 if prop_m1 in ("LV", "V") and prop in ("V", "T"):
 return False
 if prop_m1 in ("LVT", "T") and prop == "T":
 return False
 
 # Don't break between regional indicator symbols.
 if (prop_m1 == "REGIONALINDICATOR" and prop ==
 "REGIONALINDICATOR"):
 return False
 
 # Don't break just before Extend characters.
 if prop == "Extend":
 return False
 
 # Don't break before SpacingMarks, or after Prepend characters.
 if prop == "SpacingMark":
 return False
 
 if prop_m1 == "Prepend":
 return False
 
 # Otherwise, break everywhere.
 return True
msg299697 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017年08月03日 11:07
Issue30717 has a patch.
History
Date User Action Args
2022年04月11日 14:57:47adminsetgithub: 62606
2017年08月03日 11:07:26serhiy.storchakasetstatus: open -> closed
superseder: Add unicode grapheme cluster break algorithm
messages: + msg299697

resolution: duplicate
stage: needs patch -> resolved
2017年07月24日 04:21:00serhiy.storchakasetnosy: + serhiy.storchaka

versions: + Python 3.7, - Python 3.4, Python 3.5
2017年07月24日 02:20:01Socobsetnosy: + Socob
2013年07月09日 17:25:36mrabarnettsetnosy: + mrabarnett
messages: + msg192769
2013年07月09日 07:42:49lemburgsetnosy: + lemburg
messages: + msg192724
components: + Unicode
2013年07月08日 18:34:13ezio.melottisetnosy: + loewis, benjamin.peterson, ezio.melotti

stage: needs patch
2013年07月08日 18:32:32cvrebertsetnosy: + cvrebert
2013年07月08日 18:25:45dpkcreate

AltStyle によって変換されたページ (->オリジナル) /