Regex to parse semicolon-delimited fields is too slow

Question 1

I have a file with just 3500 lines like these:

filecontent= "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234"

Then I want to grab every line from the filecontent that matches a certain string (with python 2.7):

this_item= "IBMlalala123"
matchingitems = re.findall(".*?;.*?;.*?;.*?;.*?;.*?;.*?"+this_item,filecontent)

It needs 17 seconds for each findall. I need to search 4000 times in these 3500 lines. It takes forever. Any idea how to speed it up?

Question 2

In Python it's not possible to avoid backtracking in regexps. If you can't fix the problem by modifying your regexp, then try to use the non-regexp str.split on the large string, and run the regexp on the small individual strings.

Question 3

As a general rule if you can, try to avoid non-greedy matches(i.e. the ? in .*?). The regexes are easier to match if they can simply match whatever they want, without thinking of finding the minimum match that works. In some situations finding a "greedy pattern" becomes really complex, in which case it depends on the speed you need to have as a requirement and the readability you want to achieve.

Question 4

There are several good answers below, but I'm curious what the impact of compiling the regular expression would be.

Question 5

Looks like you've run into the O(n^2) "edge case" for backtracking regex engines: swtch.com/~rsc/regexp/regexp1.html

Question 6

@kojiro: not much. using an other regexp engine helps a lot. Tcl matches that regexp in 24.73725 microseconds on my machine.

Question 7

.*?;.*? will cause catastrophic backtracking.

To resolve the performance issues, remove .*?; and replace it with [^;]*;, that should be much faster.

Question 8

Thanks alot. I didn't got the fact at first, that I have to replace each old element with your new lement. Worked, when I did it like that :D

Question 9

Also, one other thing you should do here, is rather than repeating it, as [^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*, you should shrink it down to something more concise, for instance [^;]*(?:;[^;]*){6}.

Question 10

Another option (a fairly big shift) would be to use a regex engine other than the standard library one, which doesn't backtrack.

Question 11

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski

A few things to be commented :

Regular expressions might not be the right tool for this.
.*?;.*?;.*?;.*?;.*?;.*?;.*?" is potentially very slow and might not do what you want it to do (it could match many more ; than what you want). [^;]*; would most probably do what you want.

Question 12

The actual expression would be: .*?;.*?;.*?;.*?;.*?;.*?;.*?IBMlalala123 Mind being a bit more explicit? I tried some variations to replace my version with yours, but failed... (should return the whole line, [^;]*;IBMlalala123 just returns the id string)

Question 13

Use split, like so:

>>> filecontent = "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234";
>>> items = filecontent.split(";");
>>> items;
['13P397', 'Fotostuff', 't', 'IBM', 'IBM lalala 123|IBM lalala 1234', '28.000 things', '', 'IBMlalala123|IBMlalala1234']
>>>

I'm a bit unsure as what you wanted to do in the last step, but perhaps something like this?

>>> [(i, e) for i,e in enumerate(items) if 'IBMlalala123' in e]
[(7, 'IBMlalala123|IBMlalala1234')]
>>>

UPDATE: If I get your requirements right on the second attempt: To find all lines in file having 'IBMlalala123' as any one of the semicolon-separated fields, do the following:

>>> with open('big.file', 'r') as f:
>>> matching_lines = [line for line in f.readlines() if 'IBMlalala123' in line.split(";")]
>>>

Question 14

+1: split is usually much faster than regex for these kind of cases. Especially if you need to use the captured field values aftwerward.

Question 15

Yup, +1 for split. Regex doesn't appear to be the best tool for this job.

Question 16

In my case it was about getting every full line (from some thousand lines) that has a certain string in it. Your solution gives back a part of the line, and would need some enhancements to work with some thousand files. I imagine you would suggest splitting by '\n' and then checking each line with 'if string in line' and putting that into a list then? Don't know if this would be faster.

Question 17

@Mike: Ok, but in your example I don't see any reference to newlines, did you mean that semicolon should signify newlines? Anyway, there is no "fast" way of splitting lines. The OS does not keep track of where newlines are stored, so scanning for newline chars is the way any row-reading lib works AFAIK. But you can of course save a lot of mem by reading line by line.

Question 18

It wasn't in the example, but in the text before and after it ;) ...file with just 3500 lines... ...want to grab every line from the filecontent that matches a certain string...

Question 19

Some thoughts:

Do you need a regex? You want a line that contains the string so why not use 'in'?

If you are using the regex to validate the line format, you can do that after the less expensive 'in' finds a candidate line reducing the number of times the regex is used.

If you do need a regex then what about replacing '.?;' with '[^;];' ?

jessehouwing jessehouwing 7976 silver badges15 bronze badges · Accepted Answer · 2013-10-09 07:53:31Z

35

\$\begingroup\$

.*?;.*? will cause catastrophic backtracking.

To resolve the performance issues, remove .*?; and replace it with [^;]*;, that should be much faster.

Share

edited Oct 12, 2021 at 20:34

answered Oct 9, 2013 at 7:53

jessehouwing's user avatar

jessehouwing jessehouwing

7976 silver badges15 bronze badges

\$\endgroup\$

3

2

\$\begingroup\$ Thanks alot. I didn't got the fact at first, that I have to replace each old element with your new lement. Worked, when I did it like that :D \$\endgroup\$

Mike
– Mike

2013年10月09日 08:35:18 +00:00
Commented Oct 9, 2013 at 8:35
7

\$\begingroup\$ Also, one other thing you should do here, is rather than repeating it, as [^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*, you should shrink it down to something more concise, for instance [^;]*(?:;[^;]*){6}. \$\endgroup\$

AJMansfield
– AJMansfield

2013年10月09日 14:11:40 +00:00
Commented Oct 9, 2013 at 14:11
\$\begingroup\$ Another option (a fairly big shift) would be to use a regex engine other than the standard library one, which doesn't backtrack. \$\endgroup\$

Zachary Vance
– Zachary Vance

2021年10月12日 21:38:07 +00:00
Commented Oct 12, 2021 at 21:38

Add a comment |

Stack Exchange Network

Regex to parse semicolon-delimited fields is too slow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Regex to parse semicolon-delimited fields is too slow

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions