This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年06月11日 13:29 by ncoghlan, last changed 2022年04月11日 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| issue15045.patch | ezberch, 2012年12月12日 04:00 | patch | review | |
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 5051 | closed | CuriousLearner, 2017年12月30日 16:08 | |
| Messages (6) | |||
|---|---|---|---|
| msg162617 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2012年06月11日 13:29 | |
In working on #13857, I noticed that the current regex based implementation of textwrap.dedent() is limited specifically to ASCII whitespace (tabs and spaces) with Unix line endings (a line containing solely a Windows "\r\n" line ending will be deemed to contain a non-whitespace character, since "\r" isn't recognised by the regex) The new textwrap.indent() function added in #13857 takes a much simpler approach to whitespace handling: its definition of a "line" is exactly that of "text.splitlines(True)", while its definition of a line that does not consist solely of whitespace is "bool(line.strip())" As a simple example of how that can make a difference, consider: >>> "\N{NO-BREAK SPACE}".strip() '' One way to remedy this would be to replace the regex based implementation of textwrap.dedent with a simpler one written in terms of text.splitlines(True) and line.strip(). |
|||
| msg177363 - (view) | Author: Ezra Berch (ezberch) | Date: 2012年12月12日 04:00 | |
Here's a patch to do this, including some tests for the changed behavior. |
|||
| msg177369 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2012年12月12日 10:41 | |
Ezra's patch is working, but a little non-optimal. However, I'm not sure, that it is right to extend a set of whitespace characters. See comments at the top of the file. Also this approach should be slower, especially for trivial case (no changes). |
|||
| msg307863 - (view) | Author: Sanyam Khurana (CuriousLearner) * (Python triager) | Date: 2017年12月08日 19:05 | |
Ezra, can you please convert your patch to a Pull Request on Github? |
|||
| msg324144 - (view) | Author: Sanyam Khurana (CuriousLearner) * (Python triager) | Date: 2018年08月27日 04:04 | |
Hey Nick, Serhiy The patch is up for your review. Any updates on this? |
|||
| msg334204 - (view) | Author: Alyssa Coghlan (ncoghlan) * (Python committer) | Date: 2019年01月22日 10:53 | |
Putting some of my comments here rather than on the PR, as they're design questions related to "Is the current behaviour actually wrong?" and "Even if the current behaviour is deemed technically incorrect, is it worth the risk of changing it after all these years?".
I've also retitled the issue to describe the desired outcome of increased consistency between textwrap.dedent() and textwrap.indent(), without expressing a preference one way or the other.
1. Correctness & consistency
Quoting the textwrap.py comment about ASCII whitespace that Serhiy mentioned:
```
# Hardcode the recognized whitespace characters to the US-ASCII
# whitespace characters. The main reason for doing this is that
# some Unicode spaces (like \u00a0) are non-breaking whitespaces.
_whitespace = '\t\n\x0b\x0c\r '
```
It's not clear whether or not non-breaking whitespace should actually be counted as an empty line, as it wouldn't be a candidate break point for line wrapping if that space appeared between two words (e.g. "first\N{NO-BREAK SPACE}second"), even though str.split() would split on it, and str.strip() would remove it from the beginning or end of a string (and that discrepancy is by design, since it's what "non-breaking" *means*).
So the interesting test cases from that perspective would be strings like:
"""\
4 space indent
\N{NO-BREAK SPACE} 4 spaces, but first is no-break
4 space indent
"""
-> textwrap.dedent() should do nothing
"""\
4 space indent
\N{NO-BREAK SPACE}
Previous line is just a single no-break space
"""
-> textwrap.dedent() should do nothing
"""\
4 space indent
\N{NO-BREAK SPACE}5 spaces, but last is no-break
4 space indent
"""
-> textwrap.dedent() should strip the common 4-space prefix
"""\
4 space indent
\N{NO-BREAK SPACE}
Previous line is indented with 4 spaces
"""
-> textwrap.dedent() should strip the common 4-space prefix
The potential inconsistency I cite above is with the then-new textwrap.indent() which *does* consider all lines consisting solely of whitespace (whether non-breaking or not) to be blank lines, and hence doesn't indent them. This means that the following test string wouldn't round trip correctly through a textwrap.indent/textwrap.dedent pair:
"""\
4 space indent
\N{NO-BREAK SPACE}
Previous line is indented as usual
"""
indent() would skip adding leading whitespace to the second line, which means dedent() would subsequently fail to detect a common leading prefix to be stripped.
However, that can easily be considered a bug in textwrap.indent() - it's the newer function, so it was a design error to make it inconsistent with the textwrap.dedent() precedent.
2. Performance
Since this issue was opened purely on design consistency grounds, it needs to offer really compelling benefits if we're going to risk a change that might make textwrap.dedent() slower.
I don't think we've reached that bar, as with universal newlines as the default text reading behaviour, it's going to be fairly rare for `textwrap.dedent()` to be applied to strings containing `\r` or `\r\n`, and if it is, it's a pretty straightforward prior normalisation step to convert both of those to `\n` via `text.sub('\r\n', '\n').sub('\r', '\n')` (or a comparable regex based on https://stackoverflow.com/questions/1331815/regular-expression-to-match-cross-platform-newline-characters)
So I think the most that can be argued for when it comes to the newline handling in dedent() is a *documentation* change that notes that textwrap.dedent() splits lines strictly on `\n`, and other line endings need to be normalized to `\n` before it will work correctly.
That leaves indent(), and I think the case can plausible be made there that it should be using _whitespace_only_re for its empty line detection (the same as dedent(), instead of using str.strip().
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:31 | admin | set | github: 59250 |
| 2019年01月22日 10:53:49 | ncoghlan | set | messages:
+ msg334204 title: Make textwrap.dedent() consistent with str.splitlines(True) and str.strip() -> Make textwrap.dedent() and textwrap.indent() handle whitespace consistently |
| 2018年08月27日 04:04:59 | CuriousLearner | set | messages: + msg324144 |
| 2017年12月30日 16:08:18 | CuriousLearner | set | pull_requests: + pull_request4932 |
| 2017年12月08日 19:05:36 | CuriousLearner | set | nosy:
+ CuriousLearner messages: + msg307863 |
| 2014年06月22日 18:41:48 | yjchen | set | nosy:
+ robertjli, yjchen |
| 2014年06月20日 15:43:52 | zach.ware | set | versions: + Python 3.5, - Python 3.4 |
| 2012年12月12日 10:41:10 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg177369 stage: needs patch -> patch review |
| 2012年12月12日 04:00:14 | ezberch | set | files:
+ issue15045.patch nosy: + ezberch messages: + msg177363 keywords: + patch |
| 2012年12月06日 01:44:06 | bruno.dupuis | set | nosy:
+ bruno.dupuis |
| 2012年06月11日 13:29:17 | ncoghlan | create | |