This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2011年04月29日 18:27 by mindauga, last changed 2022年04月11日 14:57 by admin.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| patch_11957 | umi, 2013年07月06日 15:47 | review | ||
| re_keyword_only.patch | vstinner, 2014年10月29日 16:09 | review | ||
| re_check_flags_type.patch | serhiy.storchaka, 2016年09月25日 21:14 | review | ||
| re_deprecate_positional_count.patch | serhiy.storchaka, 2016年09月25日 21:14 | review | ||
| Messages (27) | |||
|---|---|---|---|
| msg134806 - (view) | Author: Mindaugas (mindauga) | Date: 2011年04月29日 18:27 | |
re.sub don't substitute not ASCII characters:
Python 2.7.1 (r271:86832, Apr 15 2011, 12:11:58) Arch Linux
>>>import re
>>>a=u'aaa'
>>>print re.search('(\w+)',a,re.U).groups()
(u'aaa')
>>>print re.sub('(\w+)','x',a,re.U)
x
BUT:
>>>a=u'ąąą'
>>>print re.search('(\w+)',a,re.U).groups()
(u'\u0105\u0105\u0105')
>>>print re.sub('(\w+)','x',a,re.U)
ąąą
|
|||
| msg134820 - (view) | Author: Eric V. Smith (eric.smith) * (Python committer) | Date: 2011年04月29日 22:58 | |
The 4th parameter to re.sub() is a count, not flags. |
|||
| msg134830 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年04月30日 02:23 | |
Since this has been reported already several times (see e.g. #11947), and it's a fairly common mistake, I think we should do something to avoid it. A few possibilities are: 1) add a warning in the doc; 2) make count and flag keyword-only argument (raising a deprecation warning in 3.3 and actually change it later); 3) change the regex flags to some object that can be distinguished from ints and raise an error when a flag is passed to count; |
|||
| msg135371 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2011年05月06日 21:41 | |
I like the idea of an internal REflag class with __new__, __or__, and __repr__==__str__. Str(re.A|re.L) might print as "REflag: re.ASCII | re.IGNORE" If it is *not* an int subclass, any attempt to use or mix with an int would raise. I checked and the doc only promises that flags can be or'ed. An __and__ method might be added if it were thought that people currently use & to check for flags set, though that is not currently promised. |
|||
| msg135386 - (view) | Author: Matthew Barnett (mrabarnett) * (Python triager) | Date: 2011年05月06日 23:32 | |
Something like "<re.Flag ASCII | IGNORE>" may be more Pythonic. |
|||
| msg135391 - (view) | Author: Terry J. Reedy (terry.reedy) * (Python committer) | Date: 2011年05月07日 00:13 | |
Agreed, if we go that route. |
|||
| msg136657 - (view) | Author: Éric Araujo (eric.araujo) * (Python committer) | Date: 2011年05月23日 15:01 | |
I’d favor 1) or 2) over 3). Ints are short and very commonly used for flags. |
|||
| msg143520 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2011年09月05日 14:55 | |
See also #12888 for an error in the stdlib caused by this. |
|||
| msg186784 - (view) | Author: Mike Milkin (mmilkin) * | Date: 2013年04月13日 18:27 | |
I like option #2, and I was thinking of working on it today, poke me if anyone has a problem with this. |
|||
| msg186825 - (view) | Author: Mike Milkin (mmilkin) * | Date: 2013年04月13日 20:24 | |
There is no sane way to issue a warning without changing the signature and we don't want to change the signature without issuing a deprecation warning for the function, so sadly option 3 is the only way for this to work, (Im going to not touch this till ENUMS are merged in.) |
|||
| msg186832 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2013年04月13日 20:32 | |
Can't you use *args and **kwargs and then raise a deprecation warning if count and/or flags are in args? Even if enums are merged in, there might still be issues depending on their implementation. |
|||
| msg186844 - (view) | Author: Mike Milkin (mmilkin) * | Date: 2013年04月13日 21:00 | |
We could do that but we would be changing the signature before adding the warning |
|||
| msg186856 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2013年04月13日 21:37 | |
The change would still be backwards compatible (even though inspect.signature and similar functions might return something different). Note that I'm not saying that's the best option, but it should be doable. |
|||
| msg192416 - (view) | Author: Valentina Mukhamedzhanova (umi) * | Date: 2013年07月06日 11:36 | |
Please see my patch, I have changed flags to be instances of IntEnum and added a check to re.sub, re.subn and re.split. The patch contains some tests. This solution also allowed me to discover several bugs in the standard library, and I am going to create tickets for them shortly. |
|||
| msg230223 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年10月29日 16:00 | |
New changeset 767fd62b59a9 by Victor Stinner in branch 'default': Issue #11957: Explicit parameter name when calling re.split() and re.sub() https://hg.python.org/cpython/rev/767fd62b59a9 |
|||
| msg230224 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年10月29日 16:09 | |
I suggest to make the 2 last parameters of re.sub(), re.subn() and re.split() parameters as keyword-only. It will break applications using count and maxsplit parameters as index parameters, but it's easy to fix these applications if they want to support also Python 3.5.
I checked Python 2.6: the name of the maxsplit and count parameters didn't change. So it's possible to write code working on Python 2.6-3.5 if the parameter name is explicitly used:
* re.sub("a", "a", "a", count=1)
* re.subn("a", "a", "a", count=1)
* re.split("a", "a", maxsplit=1)
The flags parameter was added to re.sub(), re.subn() and re.split() functions in Python 2.7:
* https://docs.python.org/2.7/library/re.html#re.sub
* https://docs.python.org/2.7/library/re.html#re.subn
* https://docs.python.org/2.7/library/re.html#re.split
See my attached re_keyword_only.patch:
* sub(), subn(): count and flags become keyword-only parameters
* split(): maxsplit and flags become keyword-only parameters
|
|||
| msg230226 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2014年10月29日 16:15 | |
Confusion between count/maxplit and count parameters is common, duplicated issues: * Issue #22760 * Issue #17663 * Issue #15537 * Issue #12875 * Issue #12078 * Issue #11947 See also issue #13385 which proposed an explicit "re.NOFLAGS flag". |
|||
| msg230236 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年10月29日 19:49 | |
Thank you for your patch Valentina. But it makes flags combinations not pickleable. >>> import re, pickle >>> pickle.dumps(re.I|re.S, 3) Traceback (most recent call last): File "<stdin>", line 1, in <module> _pickle.PicklingError: Can't pickle <enum 'SubFlag'>: attribute lookup SubFlag on sre_constants failed >>> pickle.dumps(re.I|re.S, 4) Traceback (most recent call last): File "<stdin>", line 1, in <module> _pickle.PicklingError: Can't pickle <enum 'SubFlag'>: attribute lookup BaseFlags.__or__.<locals>.SubFlag on sre_constants failed And I'm afraid that creating new class in the "|" operator can affect performance. |
|||
| msg230237 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年10月29日 19:53 | |
As for 767fd62b59a9, I doubt that changing positional arguments to keyword argumennts in tests is justified. This can hide a bug. |
|||
| msg230238 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年10月29日 19:57 | |
And again about patch_11957. I afraid that testing isinstance(count, sre_constants.BaseFlags) on every re.sub() call will hit performance too. |
|||
| msg230358 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2014年10月31日 17:31 | |
I agree about 767fd62b59a9, there should be tests for args passed both by position and as keyword args. Serhiy, do you think the enum solution is worth pursuing, or is it better to just turn those args to keyword-only (after a proper deprecation process)? |
|||
| msg230375 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年10月31日 18:44 | |
I think that the enum solution is worth pursuing, and that we need general class which represents the set of specified named flags. I'm working on implementation of enum.IntFlags. |
|||
| msg277386 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2016年09月25日 17:39 | |
New changeset 216e8b809e4e by Serhiy Storchaka in branch '3.5': Issue #11957: Restored re tests for passing count and maxsplit as positional https://hg.python.org/cpython/rev/216e8b809e4e New changeset b39b09290718 by Serhiy Storchaka in branch '3.6': Issue #11957: Restored re tests for passing count and maxsplit as positional https://hg.python.org/cpython/rev/b39b09290718 New changeset da2c96cf2ce6 by Serhiy Storchaka in branch 'default': Issue #11957: Restored re tests for passing count and maxsplit as positional https://hg.python.org/cpython/rev/da2c96cf2ce6 |
|||
| msg277398 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2016年09月25日 21:14 | |
Here are two alternative patches. The first patch checks if count or maxsplit arguments are re.RegexFlag and raise TypeError if it is true. This makes misusing flags fail fast. The second patch deprecates passing count and maxsplit arguments as positional arguments. This imposes your to change your code (even if it is valid now) and makes hard misusing flags.
Unfortunately both ways slow down calling functions.
$ ./python -m perf timeit -s "import re" -- 're.split(":", ":a:b::c", 2)'
unpatched: Median +- std dev: 2.73 us +- 0.09 us
check_flags_type: Median +- std dev: 3.74 us +- 0.09 us
deprecate_positional_count: Median +- std dev: 10.6 us +- 0.2 us
$ ./python -m perf timeit -s "import re" -- 're.split(":", ":a:b::c", maxsplit=2)'
unpatched: Median +- std dev: 2.78 us +- 0.07 us
check_flags_type: Median +- std dev: 3.75 us +- 0.10 us
deprecate_positional_count: Median +- std dev: 2.86 us +- 0.08 us
|
|||
| msg291655 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2017年04月14日 12:50 | |
My issue #30072 has been marked as a duplicate of this one. Copy of my msg291650: The re API seems commonly misused. Example passing a re flag to re.sub(): >>> re.sub("A", "B", "ahah", re.I) 'ahah' No error, no warning, but it doesn't work. Oh, sub has 5 paramters, no 4... I suggest to convert count and flags to keyword-only parameters. To not break the world, especially legit code passing the count parameter as a position argument, an option is to have a deprecation period if these two parameters are passed a positional-only parameter. -- Another option would be to rely on the fact that re flags are now enums instead of raw integers, and so add basic type check... Is there are risk of applications using re flags serialized by pickle from Pyhon < 3.6 and so getting integers? Maybe the check should only be done if flags are passing as positional-only argument... but the implementation of such check seems may be overkill for such simple and performance-critical function, no? See issue #30067 for a recent bug in the Python stdlib! |
|||
| msg291659 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年04月14日 13:30 | |
Victor, I borrowed Guido's time machine and wrote patches implementing both your suggestions a half year ago. |
|||
| msg291677 - (view) | Author: Jakub Wilk (jwilk) | Date: 2017年04月14日 18:55 | |
+ raise TypeError("sub() takes from 2 to 4 positional arguments "
+ "but %d were given" % (4 + len(args)))
It's actually 3 to 5 for sub() and subn().
|
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:16 | admin | set | github: 56166 |
| 2017年04月14日 18:55:29 | jwilk | set | messages: + msg291677 |
| 2017年04月14日 18:43:08 | jwilk | set | nosy:
+ jwilk |
| 2017年04月14日 13:30:36 | serhiy.storchaka | set | messages: + msg291659 |
| 2017年04月14日 12:50:46 | vstinner | set | messages: + msg291655 |
| 2017年04月14日 12:45:06 | serhiy.storchaka | link | issue30072 superseder |
| 2016年09月25日 21:14:44 | serhiy.storchaka | set | files: + re_deprecate_positional_count.patch |
| 2016年09月25日 21:14:29 | serhiy.storchaka | set | files:
+ re_check_flags_type.patch stage: patch review messages: + msg277398 versions: + Python 3.6, Python 3.7, - Python 3.5 |
| 2016年09月25日 17:39:52 | python-dev | set | messages: + msg277386 |
| 2016年02月13日 00:44:14 | ezio.melotti | link | issue26354 superseder |
| 2015年03月08日 13:45:43 | serhiy.storchaka | set | dependencies: + enum: Add Flags and IntFlags |
| 2014年10月31日 18:44:27 | serhiy.storchaka | set | messages: + msg230375 |
| 2014年10月31日 17:31:51 | ezio.melotti | set | messages: + msg230358 |
| 2014年10月29日 19:57:12 | serhiy.storchaka | set | messages: + msg230238 |
| 2014年10月29日 19:53:12 | serhiy.storchaka | set | messages: + msg230237 |
| 2014年10月29日 19:49:46 | serhiy.storchaka | set | messages:
+ msg230236 versions: + Python 3.5, - Python 2.7, Python 3.3, Python 3.4 |
| 2014年10月29日 16:16:33 | vstinner | set | nosy:
+ serhiy.storchaka |
| 2014年10月29日 16:15:36 | vstinner | set | messages: + msg230226 |
| 2014年10月29日 16:15:11 | vstinner | link | issue11947 superseder |
| 2014年10月29日 16:13:16 | vstinner | link | issue15537 superseder |
| 2014年10月29日 16:12:53 | vstinner | link | issue17663 superseder |
| 2014年10月29日 16:09:48 | vstinner | link | issue22760 superseder |
| 2014年10月29日 16:09:21 | vstinner | set | files:
+ re_keyword_only.patch nosy: + vstinner messages: + msg230224 keywords: + patch |
| 2014年10月29日 16:00:25 | python-dev | set | nosy:
+ python-dev messages: + msg230223 |
| 2013年07月06日 15:47:09 | umi | set | files: + patch_11957 |
| 2013年07月06日 13:51:13 | umi | set | files: - patch_11957 |
| 2013年07月06日 12:37:46 | umi | set | files: + patch_11957 |
| 2013年07月06日 12:37:30 | umi | set | files: - patch_11957 |
| 2013年07月06日 11:36:01 | umi | set | files:
+ patch_11957 nosy: + umi messages: + msg192416 |
| 2013年06月13日 00:16:40 | vstinner | set | components: - Unicode |
| 2013年04月16日 10:28:23 | rhettinger | set | assignee: rhettinger -> ezio.melotti |
| 2013年04月13日 21:37:56 | ezio.melotti | set | messages: + msg186856 |
| 2013年04月13日 21:00:59 | mmilkin | set | messages: + msg186844 |
| 2013年04月13日 20:32:46 | ezio.melotti | set | messages: + msg186832 |
| 2013年04月13日 20:24:01 | mmilkin | set | messages: + msg186825 |
| 2013年04月13日 18:27:21 | mmilkin | set | nosy:
+ mmilkin messages: + msg186784 |
| 2013年04月10日 16:53:10 | ezio.melotti | set | type: enhancement versions: + Python 3.4, - Python 3.1, Python 3.2 |
| 2012年11月10日 05:25:06 | eric.snow | set | nosy:
- eric.snow |
| 2011年12月15日 19:08:01 | eric.snow | set | nosy:
+ eric.snow |
| 2011年09月05日 14:55:41 | ezio.melotti | set | messages: + msg143520 |
| 2011年05月23日 15:01:30 | eric.araujo | set | nosy:
+ eric.araujo messages: + msg136657 |
| 2011年05月14日 22:30:36 | rhettinger | set | assignee: rhettinger nosy: + rhettinger |
| 2011年05月14日 21:49:39 | ezio.melotti | link | issue12078 superseder |
| 2011年05月07日 00:13:28 | terry.reedy | set | messages: + msg135391 |
| 2011年05月06日 23:32:44 | mrabarnett | set | messages: + msg135386 |
| 2011年05月06日 21:41:18 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg135371 |
| 2011年04月30日 02:23:35 | ezio.melotti | set | nosy:
+ ezio.melotti, mrabarnett title: re.sub problem with unicode string -> re.sub confusion between count and flags args messages: + msg134830 versions: + Python 3.1, Python 3.2, Python 3.3 |
| 2011年04月29日 22:58:40 | eric.smith | set | nosy:
+ eric.smith messages: + msg134820 |
| 2011年04月29日 18:27:10 | mindauga | create | |