4
\$\begingroup\$

I wrote a small utility for searching our web app log files for certain strings. The idea is that if the search string is found somewhere in the entry, the entire entry should be printed.

The constraints are:

  • One log entry can have one to many lines (stacktraces for example)
  • Each log entry starts with an uppercase string specifying log type, followed by a pipe character ("ERROR|...")
  • Each log entry ends with a pipe character followed by an exclamation mark ("|!")
  • The utility should work much like a regular grep command, meaning that it takes input either from a list of files or from STDIN.

The script works as expected, but it is quite slow. Writing the same logic in an AWK script is about twice as fast when processing a large amount of data. Any suggestions on how to improve the performance?

logentrygrep.py

#!/bin/python2.7
from __future__ import print_function
import argparse
import fileinput
import re
import signal
import sys
# Disable printing stacktrace on KeyboardInterrupt
signal.signal(signal.SIGINT, lambda x,y: sys.exit(1))
# Parse arguments
parser = argparse.ArgumentParser(description='Tool for greping log entry contents')
parser.add_argument('pattern', help='Pattern to search for')
parser.add_argument('file', nargs='*', help='Files to search')
parser.add_argument('-i', '--ignore-case', dest="ignorecase", action='store_true', help='Case insensitive greping')
parser.add_argument('-v', '--verbose', dest="verbose", action='store_true', help='Print verbose messages')
args = parser.parse_args()
VERBOSE = args.verbose
if VERBOSE:
 print("Pattern to search for: " + args.pattern)
 print("File(s): " + (str(args.file) if args.file else "STDIN"))
class LogDataProcessor:
 entryHeadPattern = re.compile("^[A-Z]{3,}\|")
 entryTailPattern = re.compile("\|!$")
 totalLinesProcessed = 0
 lineBuffer = []
 matchFound = False
 def __init__(self, regex_pattern, ignorecase):
 if ignorecase:
 self.pattern = re.compile(regex_pattern, re.IGNORECASE)
 else:
 self.pattern = re.compile(regex_pattern)
 def process(self, indata):
 for line in fileinput.input(indata):
 self.processLine(line.rstrip())
 def processLine(self, line):
 self.totalLinesProcessed += 1
 if self.isEntryHead(line):
 # Flush in case previous entry did not terminate correctly
 self.flush()
 self.addToBuffer(line)
 # If it's a one-line entry (head and tail on the same line)
 if self.isEntryTail(line):
 self.flush()
 elif not self.isBufferEmpty():
 self.addToBuffer(line)
 if self.isEntryTail(line):
 self.flush()
 def flush(self):
 if self.matchFound:
 self.printBuffer()
 self.lineBuffer = []
 self.matchFound = False
 def printBuffer(self):
 for line in self.lineBuffer:
 print(line)
 def addToBuffer(self, line):
 if not self.matchFound and self.lineMatches(line):
 self.matchFound = True
 self.lineBuffer.append(line)
 def isBufferEmpty(self):
 return not self.lineBuffer
 def isEntryHead(self, line):
 return self.entryHeadPattern.match(line)
 def isEntryTail(self, line):
 return self.entryTailPattern.search(line)
 def lineMatches(self, line):
 return self.pattern.search(line)
processor = LogDataProcessor(args.pattern, args.ignorecase)
processor.process(args.file)
if VERBOSE:
 print("Lines searched: " + str(processor.totalLinesProcessed))
200_success
146k22 gold badges190 silver badges479 bronze badges
asked May 9, 2019 at 12:38
\$\endgroup\$
4
  • 1
    \$\begingroup\$ How many lines would such a log file have? Just to have a rough estimate for this would be nice. Are we talking about 1.000, 10.000 or 1.000.000 lines? \$\endgroup\$ Commented May 9, 2019 at 15:23
  • \$\begingroup\$ Why not use awk for what it is good for?... If I remember correctly it's possible to parse command line arguments with an awk script too. Hint, Python could still be used, could even pipe things around to-&-fro between gawk and Python, and using unbuffering tricks it's still possible to sorta yield a sub-process into Python. Isn't easy but it reads like ya want more performance than convenience. \$\endgroup\$ Commented May 9, 2019 at 18:04
  • \$\begingroup\$ @AlexV I tested it with a few thousand lines, but if I'd search through all log files of our 50-odd servers it would be more along the lines of a few million, so that's what I'd like to optimise for. A single log entry will not be more than maybe 50 lines though, so the lineBuffer will not grow very large. \$\endgroup\$ Commented May 9, 2019 at 19:26
  • 1
    \$\begingroup\$ @S0AndS0 I already wrote an awk script that does pretty much exactly the above, but I wanted to try writing it in python to get more convenience and to get more used to python, since I don't get to use it much. And when I noticed how much slower it was I got curious to see if it could somehow be tweaked to reduce that difference. I do realise that it will never be faster than awk, but perhaps it could be faster than what it is now? \$\endgroup\$ Commented May 9, 2019 at 19:28

1 Answer 1

2
\$\begingroup\$

On my windows 10 laptop, I ran your program on a dummy log file under the python profiler using the command line:

python -m cProfile -s cumulative loggrep.py "speech" \data\test.log 

The dummy log file has about 4.3 Mbytes of text spread across about 100k lines comprising 32772 log entries. The search pattern was "speech", which occured 444 times.

This is the ouput of the profiler:

 1322924 function calls (1322872 primitive calls) in 2.006 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
 7/1 0.000 0.000 2.006 2.006 {built-in method builtins.exec} 
 1 0.000 0.000 2.006 2.006 loggrep.py:3(<module>)
 1 0.077 0.077 1.997 1.997 loggrep.py:38(process)
100003 0.136 0.000 1.788 0.000 loggrep.py:42(processLine)
 32772 0.013 0.000 1.369 0.000 loggrep.py:60(flush) <-----
 444 0.014 0.000 1.355 0.003 loggrep.py:67(printBuffer) <-----
 9722 1.341 0.000 1.341 0.000 {built-in method builtins.print} <-----
100003 0.061 0.000 0.132 0.000 loggrep.py:71(addToBuffer)
100004 0.042 0.000 0.116 0.000 fileinput.py:248(__next__)
100003 0.028 0.000 0.075 0.000 loggrep.py:80(isEntryHead)
100004 0.060 0.000 0.071 0.000 {method 'readline' of '_io.TextIOWrapper' objects}
195293 0.068 0.000 0.068 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
100003 0.027 0.000 0.062 0.000 loggrep.py:83(isEntryTail)
 95290 0.027 0.000 0.060 0.000 loggrep.py:86(lineMatches)
100010 0.047 0.000 0.047 0.000 {method 'match' of '_sre.SRE_Pattern' objects}

The header shows the program took 2.006 seconds to run. Looking at the lines marked with "<-----", we see that flush() was called 32772 times and took about 1.369 seconds. From the code, we know flush() calls printbuffer(), which calls print(). The profile says print() was called 9722 times taking 1.341 seconds, or about 65% of the total run time.

I changed entryTailPattern to "\|!\s*$". This let me get rid of the .rstrip() on each line so that the lines would keep the '\n' at the end. This let me change the print() for loop in printBuffer() to a call to sys.stdout.writelines(). This saved about 0.5 seconds over the print() for loop.

All the other function calls seem to be in the noise, so i don't see any other promising things to try.

answered Jun 28, 2019 at 6:35
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.