I have written a Python script whose purpose is to read logs from CloudWatch and then post them to ElasticSearch. It is not quite finished but I've progressed far enough that I could benefit from feedback from Python experts, specifically:
- Is the use of Python Generators correct and idiomatic
- Is the use of Class Composition correct and idiomatic
- Anything else style-wise inappropriate for Python 3.
#!/usr/bin/env python3
import json
import time
import uuid
import os
import sys
import boto3
from elasticsearch import Elasticsearch, helpers
client = boto3.client("logs")
def usage() -> None:
print("Usage: GROUP_NAME=cloudwatch_group ES_HOST=es_host {}".format(
os.path.basename(__file__)))
sys.exit(1)
if "GROUP_NAME" not in os.environ:
usage()
if "ES_HOST" not in os.environ:
usage()
class CWLogs:
group_name = os.environ["GROUP_NAME"]
def events(self) -> None:
for event in self.__generate_events():
yield event
def __generate_streams(self) -> None:
kwargs = {
"logGroupName": self.group_name,
}
while True:
stream_batch = client.describe_log_streams(**kwargs)
yield from stream_batch["logStreams"]
try:
kwargs["nextToken"] = stream_batch["nextToken"]
except KeyError:
break
def __generate_events(self) -> None:
stream_names = \
[stream["logStreamName"] for stream in self.__generate_streams()]
for stream_name in stream_names:
kwargs = {
"logGroupName": self.group_name,
"logStreamName": stream_name,
}
while True:
logs_batch = client.get_log_events(**kwargs)
yield from logs_batch["events"]
try:
kwargs["nextToken"] = logs_batch["nextToken"]
except KeyError:
break
class ESWriter:
es_host = os.environ["ES_HOST"]
elastic = Elasticsearch()
def post(self, events: object) -> None:
try:
response = helpers.bulk(
self.elastic, self.__transformer(events))
print("\nRESPONSE:", response)
except Exception as e:
print("\nERROR:", e)
@staticmethod
def __index_name(timestamp: str) -> str:
return "eventbridge-auth0-{}".format(
time.strftime("%Y.%m", time.localtime(timestamp)))
@staticmethod
def __normalize(message: str) -> str:
return message # TODO.
def __transformer(self, events: object) -> None:
for event in events:
yield self.__transform(event)
def __transform(self, event: dict) -> None:
timestamp = event["timestamp"]
index_name = self.__index_name(timestamp)
message = self.__normalize(event["message"])
return "\n".join([
json.dumps({
"index": {
"_id": str(uuid.uuid4()), # TODO. Check
"_index": index_name,
"_type": "_doc"}}),
json.dumps({
"source": {
"@source": "auto-populate script",
"@timestamp": timestamp,
"@message": message}})])
if __name__ == '__main__':
ESWriter().post(CWLogs().events())
2 Answers 2
Code Organization
Your code organization seems non-existent. You have:
- imports
- code
- function definition
- code
- class definitions
- main-guarded code
Code should be organized in a more consistent structure, like:
- imports
- class definitions
- function definitions
- main-guarded code
The point of using a main-guard is to prevent code from running if the file is imported into another file. Here, you have two separate code blocks which are unconditionally executed. This limits code reuse. For example, imagine someone could use CWLogs
for their own task, but doesn't need ESWriter
. They try from your_file import CWLogs
, and find their program exits after displaying a cryptic error message about a how to execute a program they are not actually running, due to a missing environment variable they don't actually use.
sys.exit()
Don't call this. It terminates the Python interpreter.
Any debugging you may have hoped to do when the program finishes will be impossible, because the entire Python environment imploded. It is impossible to safely import your file using try:
import your_file
except ImportError:
because Python execution terminates during the import, meaning the program trying to import it unconditionally terminated. If you try to use unittest
to test your program, or Sphinx to generate documentation for your program, or any number of other common things, you can't, because your file has unconditionally terminated the Python interpreter.
Don't call it.
Instead:
if __name__ == '__main__':
if {'GROUP_NAME', 'ES_HOST'} <= os.environ.keys():
main()
else:
usage()
No need for usage()
to call sys.exit()
. After usage()
is called, and returns normally, execution reaches the end of the file, which if this is the main program file, will naturally end the program. Of course, if this is not the main program file, the main guard would have not run either method, the execution would reach the end of the file completing the importation of the file as a module in another program.
Stop Writing Classes
See "Stop Writing Classes" for a PyCon talk by Jack Diederich.
A class with no instance data members probably shouldn't be a class. Neither ESWriter
nor CWLogs
have any instance data members.
A class with no constructor and only one public method to call shouldn't be a class. Neither ESWriter
nor CWLogs
have a constructor. Both have a single public method, called immediately after constructing a class instance, so the instance is not even saved.
These should not be classes.
Private name mangling
Private name mangling is used to prevent private member name collisions when a class is derived from another class, typically when the base class and the derive class are under control of different entities. For instance, if you derive your own class from a tkinter.Frame
, and you create a _validate
method in your class, you could cause the base class to stop functioning properly if it had its own _validate
method that was just abruptly changed on it. So, the base class would use __validate
, the leading double underscore would trigger name "mangling", and replace the name with _Frame__validate
, so collisions are less likely.
There appears to be no reason for your usage of a double underscore prefix in your method names; a single underscore would be more idiomatic.
Type Hints
Your type hints are wrong.
For instance, the following is clearly returning a str
, not None
:
def __transform(self, event: dict) -> None:
...
return "\n".join( ... )
Since __transformer
is yielding the results of __transform
, it is not returning None
either, but should be declared as:
from typing import Generator
...
def __transformer(self, events: object) -> Generator[str, None, None]:
...
Or simply:
from typing import Iterator
...
def __transformer(self, events: object) -> Iterator[str]:
...
And events: object
is virtually meaningless, since everything in Python is an object. Either use a proper type for it, or don't bother with a type hint at all.
Generator Expressions
As yedpodtrziko noted,
def __generate_events(self) -> None:
stream_names = [stream["logStreamName"] for stream in self.__generate_streams()]
for stream_name in stream_names:
...
builds up a temporary list, only to immediately iterate through it. They made a fairly large change in the code to avoid the temporary list. There is a much smaller change that can be made:
def __generate_events(self) -> None:
stream_names = (stream["logStreamName"] for stream in self.__generate_streams())
for stream_name in stream_names:
...
Because it may be hard to see the change, I'll amplify it: the [...]
got changed to (...)
. This means instead of stream_names
being realized as an in-memory list, it becomes a generator expression, which will produce the values one at a time when asked.
It doesn't make much of a difference here, but if stream_names
was being passed to a function, instead of being used locally, the change proposed by yedpodtrziko would require reworking code much further away to accept the stream_obj
and extracting the stream names inside that function.
Don't hardcode environment variables inside the classes. Instead of this:
class CWLogs:
group_name = os.environ["GROUP_NAME"]
do it like this:
class CWLogs:
group_name = None
def __init__(self, group_name):
self.group_name = group_name
if not GROUP_NAME := getenv('GROUP_NAME'):
usage()
# pass the variable when initializing the class :
CWLogs(GROUP_NAME)
This will make the code more maintainable as it does not tightly bound it to the env. variable, but rather to whatever you'll pass there and it will be easier to write tests for such code. Moreover you dont have to repeat the variable in two places, which will increase chance you'll make a typo in one place or forget to modify it in both places when the functionality will change. Same with the ESWriter
class.
then there's this function:
def __generate_events(self) -> None:
stream_names = [stream["logStreamName"] for stream in self.__generate_streams()]
for stream_name in stream_names:
...
Here you have an unnecessary extra loop and extra list allocated in the memory. First you iterate through data returned from __generate_streams()
, and then you iterate through the same data once again. You can do this instead:
def __generate_events(self) -> None:
for stream_obj in self.__generate_streams():
stream_name = stream_obj['logStreamName']
Explore related questions
See similar questions with these tags.
if
statements tofor n in ['GROUP_NAME, 'ES_HOST'']: if n not in os.environ: usage()
\$\endgroup\$