Slow RegEx to parse from log file

Question 1

I have the following lines to parse from a log file with 8000 lines which takes about 5 minutes to parse:

text="2017-01-12 15:16:42,404 - RestLogger[10c059f7] - INFO - rest_server::log_request:102 - REST call: <POST /multi/ HTTP/1.1> with args {} and content {u'inv_initiators': {u'path': u'/v2/types/initiators/', u'args': }} from user admin @ 10.106.97.145\n"
text+="2016-10-06 20:58:04,025 - RestLogger - INFO - rest_server::log_request:98 - REST call: <GET /types/volumes/59 HTTP/1.1> with args {}\n"
re.findall('([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?\n', text)

Since I would like to grab lines with optional text "and content" I am adding ?: and ? to grab the lines in one regex and the results are really slow.

Any advice how to make it faster?

Question 2

Reg ex engines can vary. Can you clarify what language/engine you're using?

Question 3

What do you intend to do with the results? I suggest that you show more code so that we understand the context.

Question 4

Seems like instead of re.findall, you should be using re.match on each line of the input individually. Something like this:

for line in text.splitlines():
 m = re.match(r'([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?', text)
 # and then do something with m

The main problem with your original regex is that the first instance of .* could reasonably match a whole lot of the string; if you're not going to split the lines apart first, then at least you'd want to change .* to [^\n]* and so on throughout.

As a micro-optimization, you may find that re.compileing your regex first produces faster results; and if you're not compiling it ahead of time, I wouldn't be surprised if

([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3})

is outperformed by a simple

(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d)

Basically, don't try to be clever and you'll do fine.

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Answer 1 · 2017-01-24 05:44:13Z

Seems like instead of re.findall, you should be using re.match on each line of the input individually. Something like this:

for line in text.splitlines():
 m = re.match(r'([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?', text)
 # and then do something with m

The main problem with your original regex is that the first instance of .* could reasonably match a whole lot of the string; if you're not going to split the lines apart first, then at least you'd want to change .* to [^\n]* and so on throughout.

As a micro-optimization, you may find that re.compileing your regex first produces faster results; and if you're not compiling it ahead of time, I wouldn't be surprised if

([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3})

is outperformed by a simple

(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d)

Basically, don't try to be clever and you'll do fine.

Stack Exchange Network

Slow RegEx to parse from log file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slow RegEx to parse from log file

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions