Issue 13828: Further improve casefold documentation

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58036

classification

Title:	Further improve casefold documentation
Type:	Stage:	needs patch
Components:	Documentation	Versions:	Python 3.8, Python 3.7

process

Dependencies:	Superseder:
Status:	open	Resolution:
Assigned To:	Mariatta	Nosy List:	Jim.Jewett, Marc Richter, Mariatta, MrSupertash, benjamin.peterson, cheryl.sabella, docs@python, mark, rhettinger
Priority:	normal	Keywords:

Created on 2012年01月19日 17:06 by Jim.Jewett, last changed 2022年04月11日 14:57 by admin.

Messages (11)
msg151644 - (view)	Author: Jim Jewett (Jim.Jewett) * (Python triager)	Date: 2012年01月19日 17:06
> http://hg.python.org/cpython/rev/0b5ce36a7a24 > changeset: 74515:0b5ce36a7a24 > + Casefolding is similar to lowercasing but more aggressive because it is > + intended to remove all case distinctions in a string. For example, the German > + lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already > + lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` > + converts it to ``"ss"``. Perhaps add the recommendation to canonicalize as well. A complete, but possibly too long, try is below: Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` converts it to ``"ss"``. Note that most case-insensitive matches should also match compatibility equivalent characters. The casefolding algorithm is described in section 3.13 of the Unicode Standard. Per D146, a compatibility caseless match can be achieved by from unicodedata import normalize def caseless_compat(string): nfd_string = normalize("NFD", string) nfkd1_string = normalize("NFKD", nfd_string.casefold()) return normalize("NFKD", nfkd1_string.casefold())
msg151645 - (view)	Author: Jim Jewett (Jim.Jewett) * (Python triager)	Date: 2012年01月19日 17:09
Frankly, I do think that sample code is too long, but correctness matters ... perhaps a better solution would be to add either a method or a unicodedata function that does the work, then the extra note could just say Note that most case-insensitive matches should also match compatibility equivalent characters; see unicodedata.compatibity_casefold
msg151665 - (view)	Author: Benjamin Peterson (benjamin.peterson) * (Python committer)	Date: 2012年01月20日 01:12
It's a bit unfriendly to launch into discussion of "compatiblity caseless matching" when the new reader probably has no idea what "compatibility-equivalence" is.
msg253662 - (view)	Author: Mark Summerfield (mark) *	Date: 2015年10月29日 07:14
I think the str.casefold() docs are fine as far as they go, rightly covering what it _does_ rather than _how_, yet providing a reference for the details. But what they lack is more complete information. For example I discovered this: >>> x = "files and shuffles" >>> x 'files and shuffles' >>> x.casefold() 'files and shuffles' In view of this I would add one sentence: In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
msg253797 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2015年10月31日 15:36
> In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi". +1 I would have found that sentence to be helpful.
msg327334 - (view)	Author: Marc Richter (Marc Richter)	Date: 2018年10月08日 09:33
+1 as well. To be honest, I did not understand what this function does in detail yet. Since not too long ago (2017) in Germany, there was an uppercase-variant for the special letter from this function's example (ß) been added to the official orthography [1]. Is this something that needs to be changed in this function's behavior now or stays this expected behavior? I'm still puzzled and I think the whole function should get a more clear description. [1]: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
msg338689 - (view)	Author: Cheryl Sabella (cheryl.sabella) * (Python committer)	Date: 2019年03月23日 16:53
Assigning to @Mariatta for the sprints.
msg375842 - (view)	Author: Thorsten (MrSupertash)	Date: 2020年08月24日 13:48
German example in casefolding is plain incorrect. #Casefolding is similar to lowercasing but more aggressive because it is #intended to remove all case distinctions in a string. For example, the #German lowercase letter 'ß' is equivalent to "ss". Since it is already #lowercase, lower() would do nothing to 'ß'; casefold() converts it to #"ss". It is not true that "ß" is equivalent to "ss" and has not been since an orthography reform in 1996. These are to be used in distinct use cases. "ß" after a diphthong or a long/open vowel. "ss" after a short/closed vowel. The documentation correctly describes (in this case) how Python handles the .casefold() for this letter, although the behavior itself is incorrect. As mentioned before, in 2017 an official upper-case version of "ß" has been introduced into German orthography: "ẞ". The German example should be stated as current incorrect behavior in the documentation. +1 to adding previously mentioned sentence: In addition to lowercasing, this function also expands ligatures, for example, "fi" becomes "fi".
msg375844 - (view)	Author: Benjamin Peterson (benjamin.peterson) * (Python committer)	Date: 2020年08月24日 13:52
Correctness of casefolding is defined by the Unicode standard, which currently states that "ß" folds to "ss".
msg375847 - (view)	Author: Thorsten (MrSupertash)	Date: 2020年08月24日 15:01
I see. I found the documents. That's an issue. That usage is incorrect. It is still valid to upper case "ß" to SS since "ẞ" is fairly new as an official German character, but the other way around is not valid. As such the current sentence in documentation also just does not make sense. >"Since it is already lowercase, lower() would do nothing to 'ß'" Exactly. Why would it? It is nonsensical to change an already lowercase character with a lowercase function. Suggest to update to: "For example, the Unicode standard for German lower case letter 'ß' prescribes full casefolding to 'ss'. Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to 'ss'. In addition to full lowercasing, this function also expands ligatures, for example, 'fi' becomes 'fi'."
msg375858 - (view)	Author: Jim Jewett (Jim.Jewett) * (Python triager)	Date: 2020年08月24日 17:39
Unicode probably won't make the correction, because of backwards compatibility. I do support the sentence suggested in Thorsten's most recent reply. Is expanding ligatures the only other normalization it does? Ideally, we should also mention that it shifts to the canonical case, which is usually (but not always) lowercase. I think Cherokee is one that folds to the upper case. On Mon, Aug 24, 2020 at 11:02 AM Thorsten <report@bugs.python.org> wrote: > > Thorsten <mrsupertash@gmail.com> added the comment: > > I see. I found the documents. That's an issue. That usage is incorrect. It > is still valid to upper case "ß" to SS since "ẞ" is fairly new as an > official German character, but the other way around is not valid. > > As such the current sentence in documentation also just does not make > sense. > > >"Since it is already lowercase, lower() would do nothing to 'ß'" > > Exactly. Why would it? It is nonsensical to change an already lowercase > character with a lowercase function. > > Suggest to update to: > > "For example, the Unicode standard for German lower case letter 'ß' > prescribes full casefolding to 'ss'. Since it is already lowercase, lower() > would do nothing to 'ß'; casefold() converts it to 'ss'. > In addition to full lowercasing, this function also expands ligatures, for > example, 'fi' becomes 'fi'." > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue13828> > _______________________________________ >

History
Date	User	Action	Args
2022年04月11日 14:57:25	admin	set	github: 58036
2020年08月24日 17:39:41	Jim.Jewett	set	messages: + msg375858
2020年08月24日 15:01:42	MrSupertash	set	messages: + msg375847
2020年08月24日 13:52:36	benjamin.peterson	set	messages: + msg375844
2020年08月24日 13:48:30	MrSupertash	set	nosy: + MrSupertash messages: + msg375842
2019年03月23日 16:53:57	cheryl.sabella	set	versions: + Python 3.7, Python 3.8, - Python 3.3 nosy: + Mariatta, cheryl.sabella messages: + msg338689 assignee: docs@python -> Mariatta stage: needs patch
2018年10月08日 09:33:46	Marc Richter	set	nosy: + Marc Richter messages: + msg327334
2015年10月31日 15:36:13	rhettinger	set	nosy: + rhettinger messages: + msg253797
2015年10月29日 07:14:19	mark	set	nosy: + mark messages: + msg253662
2012年01月20日 01:12:41	benjamin.peterson	set	messages: + msg151665
2012年01月19日 17:09:52	Jim.Jewett	set	messages: + msg151645
2012年01月19日 17:06:02	Jim.Jewett	create

homepage