This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2014年11月07日 21:42 by rexdwyer, last changed 2022年04月11日 14:58 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| re_split_zero_width.patch | serhiy.storchaka, 2014年11月08日 09:11 | Backward incompatible! | review | |
| Messages (8) | |||
|---|---|---|---|
| msg230831 - (view) | Author: Rex Dwyer (rexdwyer) | Date: 2014年11月07日 21:42 | |
I would like to split a DNA sequence with a restriction enzyme. A description enzyme can be describe as, e.g. r'(?<CA)(?=GCTG)' I cannot get re.split to split on this pattern as perl 5 does. |
|||
| msg230832 - (view) | Author: Ezio Melotti (ezio.melotti) * (Python committer) | Date: 2014年11月07日 21:47 | |
Can you provide a sample DNA sequence (or part of it), the exact code you used, the output you got, and what you expected? |
|||
| msg230833 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月07日 21:58 | |
>>> re.split(r'(?<=CA)(?=GCTG)', 'CAGCTG') ['CAGCTG'] I think expected output is ['CA', 'GCTG']. |
|||
| msg230834 - (view) | Author: Rex Dwyer (rexdwyer) | Date: 2014年11月07日 22:08 | |
sorry if I wasn't clear. s = 'ACGTCAGCTGAAACCCCAGCTGACGTACGT re.split(r'(?<CA)(?=GCTG)',s) expected output is: acgtCA|GCTGaaacccCA|GCTGacgtacgt -> ['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT'] I would also be able to split a text on word boundaries: re.split(r'\b', "the quick, brown fox") -> ['the', ' ', 'quick', ', ', 'brown', ' ', 'fox'] but that doesn't work either so maybe it's a problem with all zero-width matches. |
|||
| msg230835 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月07日 22:11 | |
This looks as one of existing issue about zero-length matches (issue1647489, issue10328). |
|||
| msg230839 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月08日 09:11 | |
It is possible to change this behavior (see example patch). With this patch:
>>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']
But unfortunately this is backward incompatible change and will likely break existing code (and breaks tests). Consider following example: re.split('(:*)', 'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', '', 'b', '', ''].
In third-part regex module [1] there is the V1 flag which switches incompatible bahavior change.
>>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']
I don't know how to solve this issue without introducing such flag (or adding special boolean argument to re.split()).
As a workaround I suggest you to use the regex module.
[1] https://pypi.python.org/pypi/regex
|
|||
| msg230841 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2014年11月08日 09:39 | |
Previous attempts to solve this issue: issue852532, issue988761, issue3262. |
|||
| msg237034 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2015年03月02日 08:59 | |
re.split() with the r'(?<CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see issue22818). In future releases it could be changed to work with zero-width patterns (such as lookaround assertions). |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:58:09 | admin | set | github: 67006 |
| 2015年03月02日 08:59:16 | serhiy.storchaka | set | status: open -> closed resolution: wont fix messages: + msg237034 stage: resolved |
| 2014年11月08日 09:39:13 | serhiy.storchaka | set | messages: + msg230841 |
| 2014年11月08日 09:11:19 | serhiy.storchaka | set | files:
+ re_split_zero_width.patch keywords: + patch messages: + msg230839 |
| 2014年11月07日 22:11:00 | serhiy.storchaka | set | messages: + msg230835 |
| 2014年11月07日 22:08:07 | rexdwyer | set | messages: + msg230834 |
| 2014年11月07日 21:58:27 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg230833 |
| 2014年11月07日 21:47:45 | ezio.melotti | set | messages: + msg230832 |
| 2014年11月07日 21:42:01 | rexdwyer | create | |