I have a file with just 3500 lines like these:
filecontent= "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234"
Then I want to grab every line from the filecontent
that matches a certain string (with python 2.7):
this_item= "IBMlalala123"
matchingitems = re.findall(".*?;.*?;.*?;.*?;.*?;.*?;.*?"+this_item,filecontent)
It needs 17 seconds for each findall
. I need to search 4000 times in these 3500 lines. It takes forever. Any idea how to speed it up?
4 Answers 4
.*?;.*?
will cause catastrophic backtracking.
To resolve the performance issues, remove .*?;
and replace it with [^;]*;
, that should be much faster.
-
2\$\begingroup\$ Thanks alot. I didn't got the fact at first, that I have to replace each old element with your new lement. Worked, when I did it like that :D \$\endgroup\$Mike– Mike2013年10月09日 08:35:18 +00:00Commented Oct 9, 2013 at 8:35
-
7\$\begingroup\$ Also, one other thing you should do here, is rather than repeating it, as
[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*;[^;]*
, you should shrink it down to something more concise, for instance[^;]*(?:;[^;]*){6}
. \$\endgroup\$AJMansfield– AJMansfield2013年10月09日 14:11:40 +00:00Commented Oct 9, 2013 at 14:11 -
\$\begingroup\$ Another option (a fairly big shift) would be to use a regex engine other than the standard library one, which doesn't backtrack. \$\endgroup\$Zachary Vance– Zachary Vance2021年10月12日 21:38:07 +00:00Commented Oct 12, 2021 at 21:38
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
A few things to be commented :
Regular expressions might not be the right tool for this.
.*?;.*?;.*?;.*?;.*?;.*?;.*?"
is potentially very slow and might not do what you want it to do (it could match many more;
than what you want).[^;]*;
would most probably do what you want.
-
\$\begingroup\$ The actual expression would be: .*?;.*?;.*?;.*?;.*?;.*?;.*?IBMlalala123 Mind being a bit more explicit? I tried some variations to replace my version with yours, but failed... (should return the whole line, [^;]*;IBMlalala123 just returns the id string) \$\endgroup\$Mike– Mike2013年10月09日 08:14:44 +00:00Commented Oct 9, 2013 at 8:14
Use split, like so:
>>> filecontent = "13P397;Fotostuff;t;IBM;IBM lalala 123|IBM lalala 1234;28.000 things;;IBMlalala123|IBMlalala1234";
>>> items = filecontent.split(";");
>>> items;
['13P397', 'Fotostuff', 't', 'IBM', 'IBM lalala 123|IBM lalala 1234', '28.000 things', '', 'IBMlalala123|IBMlalala1234']
>>>
I'm a bit unsure as what you wanted to do in the last step, but perhaps something like this?
>>> [(i, e) for i,e in enumerate(items) if 'IBMlalala123' in e]
[(7, 'IBMlalala123|IBMlalala1234')]
>>>
UPDATE: If I get your requirements right on the second attempt: To find all lines in file having 'IBMlalala123' as any one of the semicolon-separated fields, do the following:
>>> with open('big.file', 'r') as f:
>>> matching_lines = [line for line in f.readlines() if 'IBMlalala123' in line.split(";")]
>>>
-
1\$\begingroup\$ +1: split is usually much faster than regex for these kind of cases. Especially if you need to use the captured field values aftwerward. \$\endgroup\$kriss– kriss2013年10月09日 16:17:42 +00:00Commented Oct 9, 2013 at 16:17
-
\$\begingroup\$ Yup, +1 for split. Regex doesn't appear to be the best tool for this job. \$\endgroup\$Josh Anderson– Josh Anderson2013年10月09日 18:36:54 +00:00Commented Oct 9, 2013 at 18:36
-
\$\begingroup\$ In my case it was about getting every full line (from some thousand lines) that has a certain string in it. Your solution gives back a part of the line, and would need some enhancements to work with some thousand files. I imagine you would suggest splitting by '\n' and then checking each line with 'if string in line' and putting that into a list then? Don't know if this would be faster. \$\endgroup\$Mike– Mike2013年11月13日 13:44:37 +00:00Commented Nov 13, 2013 at 13:44
-
\$\begingroup\$ @Mike: Ok, but in your example I don't see any reference to newlines, did you mean that semicolon should signify newlines? Anyway, there is no "fast" way of splitting lines. The OS does not keep track of where newlines are stored, so scanning for newline chars is the way any row-reading lib works AFAIK. But you can of course save a lot of mem by reading line by line. \$\endgroup\$Alexander Torstling– Alexander Torstling2013年11月13日 14:46:42 +00:00Commented Nov 13, 2013 at 14:46
-
\$\begingroup\$ It wasn't in the example, but in the text before and after it ;) ...file with just 3500 lines... ...want to grab every line from the filecontent that matches a certain string... \$\endgroup\$Mike– Mike2013年11月14日 15:17:33 +00:00Commented Nov 14, 2013 at 15:17
Some thoughts:
Do you need a regex? You want a line that contains the string so why not use 'in'?
If you are using the regex to validate the line format, you can do that after the less expensive 'in' finds a candidate line reducing the number of times the regex is used.
If you do need a regex then what about replacing '.?;' with '[^;];' ?
Explore related questions
See similar questions with these tags.
str.split
on the large string, and run the regexp on the small individual strings. \$\endgroup\$?
in.*?
). The regexes are easier to match if they can simply match whatever they want, without thinking of finding the minimum match that works. In some situations finding a "greedy pattern" becomes really complex, in which case it depends on the speed you need to have as a requirement and the readability you want to achieve. \$\endgroup\$