homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ^$ won't split on empty line
Type: Stage:
Components: Regular Expressions Versions: Python 2.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: skip.montanaro Nosy List: effbot, fdrake, jburgy, mkc, skip.montanaro, tim.peters
Priority: normal Keywords:

Created on 2003年12月02日 11:01 by jburgy, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4471 merged serhiy.storchaka, 2017年11月19日 23:36
PR 4678 closed serhiy.storchaka, 2017年12月02日 17:32
Messages (9)
msg19230 - (view) Author: Jan Burgy (jburgy) Date: 2003年12月02日 11:01
Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 
32 bit (Intel)] on win32
>>> import re
>>> re.compile('^$', re.MULTILINE).split('foo\n\nbar')
['foo\n\nbar']
I expect ['foo\n', '\nbar'], since, according to the 
documentation $ "in MULTILINE mode also matches 
before a newline".
Thanks, Jan
msg19231 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2003年12月02日 15:20
Logged In: YES 
user_id=31435
Confirmed on Pythons 2.1.3, 2.2.3, 2.3.2, and current CVS.
More generally, split() doesn't appear to split on any empty 
(0-length) match. For example,
>>> pat = re.compile(r'\b')
>>> pat.split('(a b)')
['(a b)']
>>> pat.findall('(a b)') # but the pattern matches 4 places
['', '', '', '']
>>>
That's probably a design constraint, but isn't documented. 
For example, if you split "abc" by the pattern x*, what do you 
expect? The pattern matches (with length 0) at 4 places, 
but I bet most people would be surprised to get
['', 'a', 'b', 'c', '']
back instead of (as they do get)
['abc']
msg19232 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2003年12月11日 13:42
Logged In: YES 
user_id=38376
Split never splits on empty substrings; see Tim's answer for a 
brief discussion.
Fred, can you perhaps add something to the documentation?
msg19233 - (view) Author: Mike Coleman (mkc) Date: 2004年01月01日 05:28
Logged In: YES 
user_id=555
Hi, I was going to file this bug just now myself, as this
seems like a really useful feature. For example, I've
several times wanted to split on '^' or '^(?=S)' (to split
up a data file into paragraphs that start with an initial
S). Instead I have to do something like '\n(?=S)', which is
rather more hideous.
To answer tim_one's challenge, yes, I *do* expect splitting
by 'x*' to break a string into letters, now that I've
thought about it. To not do so is a bizarre and surprising
behavior, IMO. (Patient: Doctor, when I split on this
nonsense pattern I get nonsense! Doctor: Then don't do that.)
The fix should be near this line in _sre.c, I think.
 if (state.start == state.ptr) {
I could work on a patch if you'll take it...
Mike
msg19234 - (view) Author: Jan Burgy (jburgy) Date: 2004年01月14日 11:07
Logged In: YES 
user_id=618572
Since I really needed the functionality described above, I 
came up with a broke-around. It's a sufficient replacement, 
maybe it belongs in some FAQ:
>>> import re
>>> re.sub('(?im)^$', '\f', 'foo\n\nbar').split('\f')
['foo\n', '\nbar']
Another "magic" byte could replace '\f'...
Regards, Jan
msg19235 - (view) Author: Mike Coleman (mkc) Date: 2004年07月11日 03:32
Logged In: YES 
user_id=555
I made a patch that addresses this (#988761).
msg55563 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007年09月01日 17:42
Doc note checked in as r57878. Can we conclude based upon Tim's
and Fredrik's comments that this behavior is to be expected and
won't change? If so, I'll close this item.
msg55625 - (view) Author: Mike Coleman (mkc) Date: 2007年09月03日 21:22
Well, I think we can conclude that it's expected by *them*. :-) I
still find it surprising, and it somewhat lessens the utility of
re.split for my use cases. (I think re.finditer may also suffer from
the same problem, but I don't recall.)
If you look at the comments attached to the patch for this bug, it
looks like akuchling and rhettinger more or less saw this as being a bug 
worth fixing, though there were questions about exactly what the
correct fix should be.
http://bugs.python.org/issue988761
One comment about the your doc fix: You highlight a fairly useless
zero-character match (e.g., "x*") to demonstrate the behavior, which
might leave the user scratching his head. (I think this case was
originally mentioned as a corner case, not one that would be useful.) 
It'd be nice to highlight a
more useful case like '^(?=S)' or perhaps a little more generically
something like '^(?=HEADER)' or '^(?=BEGIN)' which is a usage that
tripped me up in the first place.
Thanks for working on this!
msg65475 - (view) Author: Mike Coleman (mkc) Date: 2008年04月14日 18:48
I'd feel better about this bug being 'wont fix'ed if I had a sense that
several people considered the patch and thought that it sucked. At the
moment, it seems more like it just fell off of the end without ever
being seriously contemplated. :-(
History
Date User Action Args
2022年04月11日 14:56:01adminsetgithub: 39646
2017年12月02日 17:32:37serhiy.storchakasetpull_requests: + pull_request4588
2017年11月19日 23:36:58serhiy.storchakasetpull_requests: + pull_request4405
2008年04月14日 18:48:27mkcsetmessages: + msg65475
2008年04月13日 03:29:45skip.montanarosetstatus: pending -> closed
2007年09月03日 21:22:29mkcsetmessages: + msg55625
2007年09月01日 17:42:19skip.montanarosetstatus: open -> pending
assignee: fdrake -> skip.montanaro
resolution: postponed -> wont fix
messages: + msg55563
nosy: + skip.montanaro
2003年12月02日 11:01:38jburgycreate

AltStyle によって変換されたページ (->オリジナル) /