This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年09月17日 16:09 by serhiy.storchaka, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re_named_consts.patch | serhiy.storchaka, 2014年09月17日 16:09 | review | ||
| re_opcode_enums.patch | serhiy.storchaka, 2014年09月19日 08:00 | |||
| Messages (15) | |||
|---|---|---|---|
| msg227008 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年09月17日 16:09 | |
Regular expression parser parses a pattern to a tree, marking nodes by string identifiers. Regular expression compiler converts this three into plain list of integers. Node's identifiers are transformed to sequential integers. Resulting list is not human readable. Proposed patch converts string constants in the sre_constants module to named integer constants. These constants doesn't need converting to integers, because they are already integers, and when printed they looks human-friendly. Now intermediate result of regular expression compiler looks much more readable.
Example.
>>> import re, sre_compile, sre_parse
>>> sre_compile._code(sre_parse.parse('[a-z_][a-z_0-9]+', re.I), re.I)
Before patch:
[17, 4, 0, 2, 2147483647, 16, 7, 27, 97, 122, 19, 95, 0, 29, 16, 1, 2147483647, 16, 11, 10, 0, 67043328, 2147483648, 134217726, 0, 0, 0, 0, 0, 1, 1]
After patch:
[INFO, 4, 0, 2, MAXREPEAT, IN_IGNORE, 7, RANGE, 97, 122, LITERAL, 95, FAILURE, REPEAT_ONE, 16, 1, MAXREPEAT, IN_IGNORE, 11, CHARSET, 0, 67043328, 2147483648, 134217726, 0, 0, 0, 0, FAILURE, SUCCESS, SUCCESS]
This patch also affects debugging output when regular expression is compiled with re.DEBUG (identifiers are uppercased and MAXREPEAT is displayed instead of 2147483647 in repeat statements).
Besides debugging output these changes are invisible for ordinal user. They are needed only for developing and debugging the re module itself. The patch doesn't affect performance and almost not affects memory consumption.
|
|||
| msg227010 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2014年09月17日 16:16 | |
Love it! |
|||
| msg227035 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年09月18日 09:58 | |
If there are no objections I'll commit the patch soon. |
|||
| msg227049 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2014年09月18日 15:03 | |
Hm. Could you not use the new Enum class? |
|||
| msg227051 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年09月18日 16:14 | |
Answering Guido's question about the Enum class. No, it is not appropriate here. It has too cumbersome repr (<OPCODES.IN_IGNORE: 16> instead of IN_IGNORE). Enum function syntax can't by used because it enumerates values from 1. We need three Enum subclasses for three groups of constants, and fourth class for MAXREPEAT, and fifth base abstract class. To fit the Enum class to our need we need more boilerplate code than to implement minimal needed functionality from scratch. And I'm not sure that this will not create circular dependency. |
|||
| msg227059 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2014年09月18日 18:26 | |
I think you are too casual in rejecting a standard approach over a custom clever hack. Making the values enums gives them a standard interface that goes beyond what you implemented, and just the fact that we can say "these are IntEnum instances" specifies a lot more of the semantics than pointing to your class hack. It's easy to make a subclass of IntEnum that overrides __repr__ and __str__. It's also easy enough to override the constructor to make the values start at 0 (though it would be a nice feature if we could add a keyword arg to the EnumMeta.__call__() definition to override the start position). I don't really care that this is more code. I don't see where there would be a circular dependency; the enum module doesn't import the re module. There is one thing that might be less convenient: defining an enum doesn't automatically make the values globals. But wouldn't the code be better if the values weren't globals? Finally. You objected to adding __all__ in the code review. That too, suggests a somewhat casual attitude. This code may be maintained by your grandchildren. Give them something future-proof, please. On Thu, Sep 18, 2014 at 9:14 AM, Serhiy Storchaka <report@bugs.python.org> wrote: > > Serhiy Storchaka added the comment: > > Answering Guido's question about the Enum class. No, it is not appropriate > here. It has too cumbersome repr (<OPCODES.IN_IGNORE: 16> instead of > IN_IGNORE). Enum function syntax can't by used because it enumerates values > from 1. We need three Enum subclasses for three groups of constants, and > fourth class for MAXREPEAT, and fifth base abstract class. To fit the Enum > class to our need we need more boilerplate code than to implement minimal > needed functionality from scratch. And I'm not sure that this will not > create circular dependency. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue22434> > _______________________________________ > |
|||
| msg227083 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年09月19日 08:00 | |
Here is a patch using enums. I still think enums are superfluous here. Advanced the Enum class features (pickling, access by name, type checking) are not needed - these constants don't leaked in outer word. > I don't see where there would be a circular dependency; the enum module > doesn't import the re module. Right now there is no circular dependency. But the enum module imports collections which imports a lot of other modules, some of which import other modules. In future some of indirectly imported module can import the re module. > There is one thing that might be less convenient: defining an enum doesn't > automatically make the values globals. But wouldn't the code be better if > the values weren't globals? I afraid this will make the code of parser and compiler less readable and slower. In any case the sre_constants module itself is a namespace. |
|||
| msg227214 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2014年09月21日 12:12 | |
I think I agree with Serhiy. Augmenting the import dependencies from the re module can have side effects. While the alternative (a custom Int subclass) isn't extremely pretty, it does the job. NamedIntConstant should be private, though. |
|||
| msg227218 - (view) | Author: Guido van Rossum (gvanrossum) * (Python committer) | Date: 2014年09月21日 15:48 | |
OK, then if you want to review Serhiy's original patch that would be great. On Sun, Sep 21, 2014 at 5:12 AM, Antoine Pitrou <report@bugs.python.org> wrote: > > Antoine Pitrou added the comment: > > I think I agree with Serhiy. Augmenting the import dependencies from the > re module can have side effects. While the alternative (a custom Int > subclass) isn't extremely pretty, it does the job. > > NamedIntConstant should be private, though. > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue22434> > _______________________________________ > |
|||
| msg227220 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年09月21日 16:50 | |
> NamedIntConstant should be private, though. Agree. I'll add underscore to NamedIntConstant and makecode before committing (as in enum patch). |
|||
| msg230865 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月08日 15:39 | |
Could you please make a review of any patch Antoine? This would help me to debug re engine. It doesn't matter which patch apply, with good chance all this will be changed before 3.5 release and may be not once. |
|||
| msg230878 - (view) | Author: Raymond Hettinger (rhettinger) * (Python committer) | Date: 2014年11月09日 02:52 | |
I reviewed re_named_consts.patch and it looks great (I especially like the removal of superfluous OPCODES dictionary lookups and improved repr for the integer codes). Since the op codes are singletons, you can use identity tests instead of equality checks in sre_parse.py: - if op == "in": + if op == IN: Also, I'll echo the suggestion to make NamedIntConstant private with a leading underscore. Nice work. |
|||
| msg230896 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月09日 18:45 | |
Thank you Raymond for the review. > I especially like the removal of superfluous OPCODES dictionary lookups The disadvantage is that now backporting patches to old branches is harder. > Since the op codes are singletons, you can use identity tests instead of > equality checks in sre_parse.py: This is deliberate. There is small change that the opcodes could became an integers in future (after adding introspection to pattern object, or make it copyable, or allowing serialization of compiled codes). "is" instead of "==" can cause subtle bug. Note that "is" is used during parsing, but getwidth() is called from sre_compile.py. |
|||
| msg230941 - (view) | Author: Roundup Robot (python-dev) (Python triager) | Date: 2014年11月10日 08:23 | |
New changeset fc7dbba57869 by Serhiy Storchaka in branch 'default': Issue #22434: Constants in sre_constants are now named constants (enum-like). https://hg.python.org/cpython/rev/fc7dbba57869 |
|||
| msg231043 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月11日 19:23 | |
> Since the op codes are singletons, you can use identity tests instead of > equality checks in sre_parse.py: Please ignore my reply in previous message. Op codes are always tested for identity in sre_compile.py, so I have applied your suggestion in sre_parse.py (30a6c74ad87f). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:08 | admin | set | github: 66624 |
| 2014年11月11日 19:23:50 | serhiy.storchaka | set | messages: + msg231043 |
| 2014年11月10日 08:23:23 | python-dev | set | nosy:
+ python-dev messages: + msg230941 |
| 2014年11月09日 19:42:48 | serhiy.storchaka | set | status: open -> closed resolution: fixed stage: patch review -> resolved |
| 2014年11月09日 18:45:04 | serhiy.storchaka | set | messages: + msg230896 |
| 2014年11月09日 02:54:03 | rhettinger | set | nosy:
+ effbot |
| 2014年11月09日 02:52:26 | rhettinger | set | nosy:
+ rhettinger messages: + msg230878 |
| 2014年11月08日 15:39:15 | serhiy.storchaka | set | messages: + msg230865 |
| 2014年09月21日 16:50:45 | serhiy.storchaka | set | messages: + msg227220 |
| 2014年09月21日 15:48:37 | gvanrossum | set | messages: + msg227218 |
| 2014年09月21日 12:12:36 | pitrou | set | messages: + msg227214 |
| 2014年09月19日 18:52:16 | terry.reedy | set | nosy:
+ terry.reedy |
| 2014年09月19日 08:00:14 | serhiy.storchaka | set | files:
+ re_opcode_enums.patch messages: + msg227083 |
| 2014年09月18日 18:26:11 | gvanrossum | set | messages: + msg227059 |
| 2014年09月18日 16:14:49 | serhiy.storchaka | set | messages: + msg227051 |
| 2014年09月18日 15:03:26 | gvanrossum | set | messages: + msg227049 |
| 2014年09月18日 09:58:40 | serhiy.storchaka | set | assignee: serhiy.storchaka messages: + msg227035 |
| 2014年09月17日 16:16:55 | gvanrossum | set | nosy:
+ gvanrossum messages: + msg227010 |
| 2014年09月17日 16:09:48 | serhiy.storchaka | create | |