Message 144723 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, gvanrossum, loewis, tchrist, terry.reedy, vstinner
Date	2011年10月01日.11:07:48
SpamBayes Score	6.5948184e-06
Marked as misclassified	No
Message-id	<32317.1317467261@chthon>
In-reply-to	<4E86F2A2.9020107@v.loewis.de>

Content
Martin v. Löwis <report@bugs.python.org> wrote on 2011年10月01日 10:59:48 -0000: >> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc. > Where did you get that definition from? UTS#18 defines > "<word_character>", which is Alphabetic + U+200C + U+200D > (i.e. not including marks, but including those From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character is defined to be \p{alpha} \p{gc=Mark} \p{digit} \p{gc=Connector_Punctuation} >> I think you are looking for here are Word characters without >> Nd + Pc, so just Alphabetic + Mn+Mc+Me. >> >> Is that right? > > With your definition of "Word character" above, yes, that's right. It's not mine. It's tr18's. > Marks won't start a word, though. That's the smarter boundary thing they talk about. I'm not myself familiar with \pM > As for terminology: I think the documentation should continue to > speak about "words" and "letters", and then define what is meant > in this context. It's not that the Unicode consortium invented > the term "letter", so we should use it more liberally than just > referring to the L* categories. I really don't think it wise to have private definitions of these. If Letter doesn't mean L?, things get too weird. That's why there are separate definitions of alphabetic, word, etc. --tom

Content

Martin v. Löwis <report@bugs.python.org> wrote
 on 2011年10月01日 10:59:48 -0000: 
>> * Word characters are Alphabetic + Mn+Mc+Me + Nd + Pc.
> Where did you get that definition from? UTS#18 defines
> "<word_character>", which is Alphabetic + U+200C + U+200D
> (i.e. not including marks, but including those
From UTS#18 RL1.2A in Annex C, where a \p{word} or \w character 
is defined to be 
 \p{alpha}
 \p{gc=Mark}
 \p{digit}
 \p{gc=Connector_Punctuation}
>> I think you are looking for here are Word characters without 
>> Nd + Pc, so just Alphabetic + Mn+Mc+Me. 
>> 
>> Is that right?
> 
> With your definition of "Word character" above, yes, that's right.
It's not mine. It's tr18's.
> Marks won't start a word, though.
That's the smarter boundary thing they talk about. 
I'm not myself familiar with \pM
> As for terminology: I think the documentation should continue to
> speak about "words" and "letters", and then define what is meant
> in this context. It's not that the Unicode consortium invented
> the term "letter", so we should use it more liberally than just
> referring to the L* categories.
I really don't think it wise to have private definitions of these.
If Letter doesn't mean L?, things get too weird. That's why 
there are separate definitions of alphabetic, word, etc.
--tom

History
Date	User	Action	Args
2011年10月01日 11:07:49	tchrist	set	recipients: + tchrist, gvanrossum, loewis, terry.reedy, vstinner, ezio.melotti, Arfrever
2011年10月01日 11:07:49	tchrist	link	issue12737 messages
2011年10月01日 11:07:48	tchrist	create

homepage