Grapheme clusters, a.k.a.real characters

Rustom Mody rustompmody at gmail.com
Sun Jul 16 00:33:28 EDT 2017


On Sunday, July 16, 2017 at 4:09:16 AM UTC+5:30, Mikhail V wrote:
> On 2017年7月15日 05:50 pm, Marko Rauhamaa wrote:
> > Random access to code points is as uninteresting as random access to
> > UTF-8 bytes.
> > I might want random access to the "Grapheme clusters, a.k.a.real
> > characters".
>> What _real_ characters are you referring to?
> If your data has "á" (U00E1), then it is one real character,
> if you have "a" (U0061) and "ˊ" (U02CA) then it is _two_
> real characters. So in both cases you have access to code points =
> real characters.

Right now in an adjacent mailing list (debian) I see someone signed off with a
grüß
I guess the third character is a u with some ‘dirt’
Whats the fourth?
>> For metaphysical discussion - in _my_ definition there

s/metaphysical/linguistic
> is no such "real" character as "á", since it is the "a" glyph with some dirt,
> so according to my definition, it should be two separate characters,
> both semantically and technically seen.
>> And, in my definition, the whole Unicode is a huge junkyard, to start with.
>> But opinions may vary, and in case you prefer or forced to write "á",
> then it can be impractical to store it as two characters, regardless of
> encoding.

Heck even in the English that I learnt in school we had
ægis, homœopath etc
And just now looking up:
https://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
I see economics is œconomics!!
Seriously the word "ligature" like the word "grapheme" is misleading
Its not a graphical or typographic notion its an atom of the language's lexicon
No Hindi speaker seeing
क + ई = की
calls the last anything but a letter
And the vowel sign ी is never first class a vowel


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /