I have the following lines to parse from a log file with 8000 lines which takes about 5 minutes to parse:
text="2017-01-12 15:16:42,404 - RestLogger[10c059f7] - INFO - rest_server::log_request:102 - REST call: <POST /multi/ HTTP/1.1> with args {} and content {u'inv_initiators': {u'path': u'/v2/types/initiators/', u'args': }} from user admin @ 10.106.97.145\n"
text+="2016-10-06 20:58:04,025 - RestLogger - INFO - rest_server::log_request:98 - REST call: <GET /types/volumes/59 HTTP/1.1> with args {}\n"
re.findall('([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?\n', text)
Since I would like to grab lines with optional text "and content" I am adding ?:
and ?
to grab the lines in one regex and the results are really slow.
Any advice how to make it faster?
-
1\$\begingroup\$ Reg ex engines can vary. Can you clarify what language/engine you're using? \$\endgroup\$forsvarir– forsvarir2017年01月23日 18:36:24 +00:00Commented Jan 23, 2017 at 18:36
-
2\$\begingroup\$ What do you intend to do with the results? I suggest that you show more code so that we understand the context. \$\endgroup\$200_success– 200_success2017年01月23日 18:45:58 +00:00Commented Jan 23, 2017 at 18:45
1 Answer 1
Seems like instead of re.findall
, you should be using re.match
on each line of the input individually. Something like this:
for line in text.splitlines():
m = re.match(r'([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?', text)
# and then do something with m
The main problem with your original regex is that the first instance of .*
could reasonably match a whole lot of the string; if you're not going to split the lines apart first, then at least you'd want to change .*
to [^\n]*
and so on throughout.
As a micro-optimization, you may find that re.compile
ing your regex first produces faster results; and if you're not compiling it ahead of time, I wouldn't be surprised if
([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3})
is outperformed by a simple
(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d)
Basically, don't try to be clever and you'll do fine.