1
\$\begingroup\$

I have the following lines to parse from a log file with 8000 lines which takes about 5 minutes to parse:

text="2017-01-12 15:16:42,404 - RestLogger[10c059f7] - INFO - rest_server::log_request:102 - REST call: <POST /multi/ HTTP/1.1> with args {} and content {u'inv_initiators': {u'path': u'/v2/types/initiators/', u'args': }} from user admin @ 10.106.97.145\n"
text+="2016-10-06 20:58:04,025 - RestLogger - INFO - rest_server::log_request:98 - REST call: <GET /types/volumes/59 HTTP/1.1> with args {}\n"
re.findall('([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?\n', text)

Since I would like to grab lines with optional text "and content" I am adding ?: and ? to grab the lines in one regex and the results are really slow.

Any advice how to make it faster?

asked Jan 23, 2017 at 16:41
\$\endgroup\$
2
  • 1
    \$\begingroup\$ Reg ex engines can vary. Can you clarify what language/engine you're using? \$\endgroup\$ Commented Jan 23, 2017 at 18:36
  • 2
    \$\begingroup\$ What do you intend to do with the results? I suggest that you show more code so that we understand the context. \$\endgroup\$ Commented Jan 23, 2017 at 18:45

1 Answer 1

3
\$\begingroup\$

Seems like instead of re.findall, you should be using re.match on each line of the input individually. Something like this:

for line in text.splitlines():
 m = re.match(r'([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3}) .*rest_server::log_request:\d+ - REST call: <(\w+) (.*) .+?> with args ({.*?})(?: and content ({.*}))?.*?', text)
 # and then do something with m

The main problem with your original regex is that the first instance of .* could reasonably match a whole lot of the string; if you're not going to split the lines apart first, then at least you'd want to change .* to [^\n]* and so on throughout.

As a micro-optimization, you may find that re.compileing your regex first produces faster results; and if you're not compiling it ahead of time, I wouldn't be surprised if

([12]\d{3}-[0-3]\d-[01]\d \d{2}:\d{2}:\d{2},\d{3})

is outperformed by a simple

(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d)

Basically, don't try to be clever and you'll do fine.

answered Jan 24, 2017 at 5:44
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.