Cassandra slow query log analysis tool

Question 1

Cassandra, when debugging is enabled, logs slow queries to the debug log file. Typical entries look like:

DEBUG [ScheduledTasks:1] 2017年02月16日 18:58:44,342 MonitoringTask.java:572 - 4 operations were slow in the last 5010 msecs:
<SELECT FROM foo.bar WHERE token(id) > token(9be90fe7-9a6d-45d5-ad11-e93cfd56def7) LIMIT 100>, time 1 msec - slow timeout 1 msec
<SELECT FROM foo.bar WHERE token(id) > token(91faceee-a64b-4fd3-bb93-ef483acade88) LIMIT 100>, time 1 msec - slow timeout 1 msec
<SELECT FROM foo.bar WHERE token(id) > token(47250d17-573a-4d76-9039-d2771a19ff10) LIMIT 100>, time 1 msec - slow timeout 1 msec
<SELECT FROM foo.bar WHERE token(id) > token(e04fc6d0-18b8-4ac0-b5f9-df42cd3a03c5) LIMIT 100>, time 1 msec - slow timeout 1 msec

The actual format is only documented in code.

For MySQL, the mysqldumpslow tool parses the logs and prints the queries (and related statistics) in a readable manner. I'm trying to write a similar tool for Cassandra, for the feature request in CASSANDRA-13000.

The goals I set are:

Use similar options to mysqldumpslow, where applicable, so I've to implement these options:
```
--help Display help message and exit
-g Only consider statements that match the pattern
-r Reverse the sort order
-s How to sort output
-t Display only first num queries
```
Sorting options:
- t, at: Sort by query time or average query time
- c: Sort by count
Of these, the -g option is yet to be implemented, since there are some problems in how the queries are logged.

I'm also adding long-form variants of these (--sort, --reverse, etc.) consistently.

Support JSON encoded input, in a streaming fashion. This is for another related patch I'm submitting, where the queries are dumped with JSON encoding for easier parsing by external tools. The JSON-encoded entry will look like:

{
 "operation": "SELECT FROM foo.bar WHERE token(id) > token(60bad0b3-551f-46c7-addc-4e3105561a21) LIMIT 100",
 "totalTime": 1,
 "timeout": 1,
 "isCrossNode": false,
 "numTimesReported": 1,
 "minTime": 1,
 "maxTime": 1,
 "keyspace": "foo",
 "table": "bar"
}

Keep compatibility with Python 2 and 3

The code:

csqldumpslow.py:

#! /usr/bin/env python3
from __future__ import print_function
import re
import sys
import getopt
import json
def usage():
 msg = """Usage:
 {} [OPTION] ... [FILE] ...
Provide a summary of the slow queries listed in Cassandra debug logs.
Multiple log files can be provided, in which case, the logs are combined.
If no file is specified, logs/debugs.log is assumed. Use - for stdin.
 -h, --help Print this message
 -s, --sort=type Sort the input
 t - total time
 at - average time
 c - count
 -r, --reverse Reverse the sort order
 -t, --top=N Print only the top N queries (only useful when sorting)
 -j, --json Assume input consists of slow queries encoded in JSON
 -o, --output=FILE Save output to FILE
"""
 print(msg.format(sys.argv[0]))
class query_stats:
 def __init__(self, time=0, avg=0, mintime=0, maxtime=0, count=1):
 if count == 1:
 self.time = self.avg = self.mintime = self.maxtime = time
 self.count = 1
 else:
 self.avg = avg
 self.mintime = mintime
 self.maxtime = maxtime
 self.count = count
 self.time = time
 def __str__(self):
 if self.count == 1:
 return "{}ms".format(self.time)
 else:
 return "{}ms ({}ms) Min: {}ms Max: {}ms".format(
 self.avg,
 self.time,
 self.mintime,
 self.maxtime
 )
class slow_query:
 def __init__(self, operation, stats, timeout,
 keyspace=None, table=None, is_cross_node=False):
 self.operation = operation
 self.stats = stats
 self.timeout = timeout
 self.keyspace = keyspace
 self.table = table
 self.is_cross_node = is_cross_node
 def __str__(self):
 return " Time: {} {} Timeout: {}\n\t{}\n".format(
 self.stats,
 "(cross-node)" if self.is_cross_node else "",
 self.timeout,
 self.operation)
class log_parser:
 regexes = {
 'start': re.compile('DEBUG.*- (\d+) operations were slow in the last (\d+) msecs:$'), # noqa
 'single': re.compile('<(.*)>, time (\d+) msec - slow timeout (\d+) msec(/cross-node)?$'), # noqa
 'multi': re.compile('<(.*)>, was slow (\d+) times: avg/min/max (\d+)/(\d+)/(\d+) msec - slow timeout (\d+) msec(/cross-node)?$'), # noqa
 }
 def __init__(self, sort, key, reverse, top, top_count, json_input):
 self.queries = []
 self.sort = sort
 self.key = key
 self.reverse = reverse
 self.top = top
 self.top_count = top_count
 self.json_input = json_input
 def process_query(self, query):
 # If we're not sorting, we can print the queries directly. If we are
 # sorting, save the query.
 if self.sort:
 self.queries.append(query)
 else:
 # If we have to print only N entries, exit after doing so
 if self.top:
 if self.top_count > 0:
 self.top_count -= 1
 else:
 sys.exit()
 print(query)
 def parse_slow_query_stats(self, line):
 match = log_parser.regexes['single'].match(line)
 if match is not None:
 self.process_query(slow_query(
 operation=match.group(1),
 stats=query_stats(int(match.group(2))),
 timeout=int(match.group(3)),
 is_cross_node=(match.group(4) is None)
 ))
 return
 match = log_parser.regexes['multi'].match(line)
 if match is not None:
 self.process_query(slow_query(
 operation=match.group(1),
 stats=query_stats(
 count=int(match.group(2)),
 avg=int(match.group(3)),
 time=int(match.group(3))*int(match.group(2)),
 mintime=int(match.group(4)),
 maxtime=int(match.group(5))
 ),
 timeout=match.group(6),
 is_cross_node=(match.group(7) is None)
 ))
 return
 print("Could not parse: " + line, file=sys.stderr)
 sys.exit(1)
 def get_json_objects(self, infile):
 # Since Python's json doesn't support streaming, try accumulating line
 # by line, and parsing.
 prev = ""
 for line in infile:
 try:
 yield json.loads(prev + line)
 except json.JSONDecodeError:
 prev += line
 def parse_json(self, infile):
 for obj in self.get_json_objects(infile):
 self.process_query(slow_query(
 operation=obj["operation"],
 stats=query_stats(
 count=obj["numTimesReported"],
 time=obj["totalTime"],
 avg=obj["totalTime"]/obj["numTimesReported"],
 mintime=obj["minTime"],
 maxtime=obj["maxTime"]
 ),
 timeout=obj["timeout"],
 is_cross_node=obj["isCrossNode"]
 ))
 def parse_log(self, infile):
 if self.json_input:
 self.parse_json(infile)
 else:
 # How many queries does the current log entry list?
 current_count = 0
 for line in infile:
 line = line.rstrip()
 if current_count > 0:
 self.parse_slow_query_stats(line)
 current_count -= 1
 else:
 match = log_parser.regexes['start'].match(line)
 if match is None:
 continue
 current_count = int(match.group(1))
 def sort_queries(self):
 # Sort by total time, default
 if self.key is None or self.key == 't':
 self.queries.sort(key=lambda x: x.stats.time, reverse=self.reverse)
 # Sort by avergae time
 elif self.key == 'at':
 self.queries.sort(key=lambda x: x.stats.avg, reverse=self.reverse)
 # Sort by count
 elif self.key == 'c':
 self.queries.sort(key=lambda x: x.stats.count, reverse=self.reverse) # noqa
 return
 def end(self):
 # Sort and print
 if self.sort:
 self.sort_queries()
 if self.top:
 self.queries = self.queries[:self.top_count]
 for q in self.queries:
 print(q)
def main():
 opts, args = getopt.gnu_getopt(
 sys.argv[1:],
 'hs:rt:jo:',
 [
 'help',
 'sort=',
 'reverse',
 'top=',
 'json',
 'output=',
 ]
 )
 # Defaults:
 # Do not sort
 sort = False
 key = None
 # Do not reverse
 reverse = False
 # Print all lines, top_count is ignored if top is unset
 top = False
 top_count = 0
 # Assume debug.log-style input, not JSON
 json_input = False
 for opt, arg in opts:
 if opt in ["-h", "--help"]:
 usage()
 sys.exit()
 elif opt in ["-s", "--sort"]:
 sort = True
 key = arg
 elif opt in ["-r", "--reverse"]:
 reverse = True
 elif opt in ["-t", "--top"]:
 top = True
 top_count = int(arg)
 elif opt in ["-j", "--json"]:
 json_input = True
 elif opt in ["-o", "--output"]:
 sys.stdout = open(arg, "a")
 else:
 print("Not yet implemented: " + opt)
 sys.exit(1)
 if len(args) == 0:
 # Default to reading the debug.log
 args = ['logs/debug.log']
 parser = log_parser(sort, key, reverse, top, top_count, json_input)
 for arg in args:
 if arg == '-':
 print("Reading from standard input")
 parser.parse_log(sys.stdin)
 else:
 with open(arg) as infile:
 print("Reading from " + arg)
 parser.parse_log(infile)
 parser.end()
if __name__ == "__main__":
 main()

Question 2

Here are some notes about the code (both performance and code style related):

since you are initializing a lot of slow_query and query_stats (also see note about the naming below) class instances on the fly, to improve the memory usage and performance, use __slots__:
```
class slow_query:
 __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
 # ...
```
switching from json to ujson may dramatically improve the JSON parsing speed
or, you can try the PyPy and simplejson combination (ujson won't work on PyPy since it is written in C, simplejson is a fast pure-python parser)
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:
```
r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:$'
 no group here^
```
the wild card matches in the regular expressions can be non-greedy - .*? instead of .* (not sure if it will have a measurable impact on performance)
class names should use a "CamelCase" convention (PEP8 reference)
the .get_json_objects() method can be static
for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions
use 2 spaces before the # for the inline comment (PEP8 reference)
fix typo "avergae" -> "average"

you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

def sort_queries(self):
 """Sorts "queries" in place, default sort is "by time"."""
 sort_attributes = {
 't': 'time',
 'at': 'avg',
 'c': 'count'
 }
 sort_attribute = sort_attributes.get(self.key, 't')
 self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
 reverse=self.reverse)

It though feels like this mapping should be defined as a constant beforehand.

improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written

Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.

Question 3

Thanks, it took me a while to check some of these points. I like the __slots__ idea. Switching JSON modules didn't seem to make much difference (all three about 3.1 seconds for 6000 JSON objects like in the question). Since switching to non-greedy expressions doesn't seem to matter much, I'll keep the usual ones as they're more familiar to people. The argparse module is great, and I can combine its type checking with the dict in the sort_query to make fixed choices easily. (It did lead me to an oddity in its handling of defaults: stackoverflow.com/q/42297956/2072269

Question 4

Writing meaningful docstrings is a weakness, I need to work on that. Profiling seems to indicate that get_json_objects is where most of the action is. I'll need to experiment more with ujson and simplejson.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-02-17 03:20:50Z

Here are some notes about the code (both performance and code style related):

since you are initializing a lot of slow_query and query_stats (also see note about the naming below) class instances on the fly, to improve the memory usage and performance, use __slots__:
```
class slow_query:
 __slots__ = ["operation", "stats", "timeout", "keyspace", "table", "is_cross_node"]
 # ...
```
switching from json to ujson may dramatically improve the JSON parsing speed
or, you can try the PyPy and simplejson combination (ujson won't work on PyPy since it is written in C, simplejson is a fast pure-python parser)
think about the capturing groups in your regular expressions, you can avoid capturing more things than you actually need. For example, in the "start" regular expression you have 2 capturing groups, but you actually use only the first one:
```
r'DEBUG.*- (\d+) operations were slow in the last \d+ msecs:$'
 no group here^
```
the wild card matches in the regular expressions can be non-greedy - .*? instead of .* (not sure if it will have a measurable impact on performance)
class names should use a "CamelCase" convention (PEP8 reference)
the .get_json_objects() method can be static
for the CLI parameter parsing I would use argparse module - you would avoid the boilerplate code you have in the main() and usage() functions
use 2 spaces before the # for the inline comment (PEP8 reference)
fix typo "avergae" -> "average"

you can improve the readability of the sort_queries() method by introducing a mapping between the key and the sort attribute name, something along these lines:

def sort_queries(self):
 """Sorts "queries" in place, default sort is "by time"."""
 sort_attributes = {
 't': 'time',
 'at': 'avg',
 'c': 'count'
 }
 sort_attribute = sort_attributes.get(self.key, 't')
 self.queries.sort(key=lambda x: getattr(x.stats, sort_attribute), 
 reverse=self.reverse)

It though feels like this mapping should be defined as a constant beforehand.

improve on documentation: add meaningful docstrings to the class methods, put comments whenever you think the reader may have difficulties to understand the code - remember, the code is being read much more often than written

Note that this is what I can see by looking at the code. Of course, to really identify the bottleneck(s), you should profile the code properly on a large input.

Thanks, it took me a while to check some of these points. I like the __slots__ idea. Switching JSON modules didn't seem to make much difference (all three about 3.1 seconds for 6000 JSON objects like in the question). Since switching to non-greedy expressions doesn't seem to matter much, I'll keep the usual ones as they're more familiar to people. The argparse module is great, and I can combine its type checking with the dict in the sort_query to make fixed choices easily. (It did lead me to an oddity in its handling of defaults: stackoverflow.com/q/42297956/2072269
Writing meaningful docstrings is a weakness, I need to work on that. Profiling seems to indicate that get_json_objects is where most of the action is. I'll need to experiment more with ujson and simplejson.

Stack Exchange Network

Cassandra slow query log analysis tool

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Cassandra slow query log analysis tool

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions