This question has a follow-up review posted here
This is a project I have been working on. This is one of my first experiences with Python and OOP as a whole. I have written a GUI that handles the inputs for these classes, but I will ask for a separate review for that, since the question would be rather bulky when including both.
The goal of this program is to create standard SQL (SQL server) queries for everyday use. The rationale behind this is that we regularly need similar queries, and would like to prevent common mistakes in them. The focus on this question is on the Python code however.
The information about the tables and their relation to each-other is provided by a JSON file, of which I have attached a mock-up version.
The code consists of three parts:
A universe class which handles the JSON file and creates the context of the tables.
A query class, which handles the specifications of which tables to include, which columns to take, how to join each table and optional where statements.
A PyQT GUI that handles the inputs. This is excluded in this post and will be posted separately for another review.
The JSON:
{
"graph": {
"table1": {
"tag": ["table1"],
"DBHandle": ["tables.table1"],
"Priority": [1],
"Columns": ["a", "b", "c"],
"Joins": {
"table2": ["on table2.a = table1.a", "inner"],
"table3": ["on table1.c = table3.c", "inner"]
}
},
"table2": {
"tag": ["table2"],
"DBHandle": ["tables.table2"],
"Priority": [2],
"Columns": ["a", "d", "e"],
"Joins": {
"table3": ["on table2.d=table3.d and table2.e = table3.e", "inner"]
}
},
"table3": {
"tag": ["table3"],
"DBHandle": ["tables.table3"],
"Priority": [4],
"Columns": ["c", "d", "e"],
"Joins": []
}
},
"presets": {
"non empty b": {
"table": ["table1"],
"where": ["table1.b is not null"]
}
}
}
The Python code:
# -*- coding: utf-8 -*-
"""
Created on Thu Aug 3 14:33:44 2017
@author: jdubbeldam
"""
from json import loads
class Universe:
"""
The Universe is a context for the Query class. It contains the information
of the available Database tables and their relation to eachother. This
information is stored in a JSON file.
"""
def __init__(self, filename):
"""
Reads the JSON and separates the information in a presets dictionary and
a graph dictionary. The latter contains the information of the nodes in
the universe/graph, including relational information.
"""
with open(filename, encoding='utf-8') as file:
self.json = loads(str(file.read()))
self.presets = self.json['presets']
self.json = self.json['graph']
self.tables = self.json.keys()
self.connections = self.getEdges()
def getEdges(self):
"""
Creates a dictionary with for each node a list of nodes that join on
that node.
"""
edges = {}
for i in self.tables:
edges[i] = []
try:
edges[i] += [j for j in self.json[i]['Joins'].keys()]
except AttributeError:
pass
for i in edges.keys():
for j in edges[i]:
if not i in edges[j]:
edges[j].append(i)
return edges
def shortestPath(self, start, end, path = []):
"""
Calculates the shortest path in a graph, using the dictionary created
in getEgdes. Adapted from https://www.python.org/doc/essays/graphs/.
"""
path = path + [start]
if start == end:
return path
if not start in self.connections.keys():
return None
shortest = None
for node in self.connections[start]:
if node not in path:
newpath = self.shortestPath(node, end, path)
if newpath:
if not shortest or len(newpath) < len(shortest):
shortest = newpath
return shortest
def joinPaths(self, nodes):
"""
Extension of shortestPath to work with multiple nodes to be connected.
The nodes are sorted based on the priority, which is taken from the JSON.
shortestPath is called on the first two nodes, then iteratively on each
additional node and one of the existing nodes returned by shortestPath,
selecting the one that takes the fewest steps.
"""
sorted_nodes = sorted([[self.json[i]['Priority'][0], i] for i in nodes])
paths = []
paths.append(self.shortestPath(sorted_nodes[0][1], sorted_nodes[1][1]))
for i in range(len(sorted_nodes) - 2):
shortest = None
flat_paths = [item for sublist in paths for item in sublist]
old_path = len(flat_paths)
for j in flat_paths:
newpath = self.shortestPath(j, sorted_nodes[i+2][1], flat_paths)
if newpath:
if not shortest or len(newpath[old_path:]) < len(shortest):
shortest = newpath[old_path:]
paths.append(shortest)
return paths
class Query:
"""
Query contains the functions that allow us to build an SQL query based on
a universe object. It maintains lists with the names of activated tables
and, if applicable, which of their columns in a dictionary. Implicit tables
are tables that are called, only to bridge joins from one table to another.
Since they are not explicitly called, we don't want their columns in the query.
how_to_join is a dictionary that allows setting joins (left, right, inner, full)
other than the defaults imported from the JSON.
"""
def __init__(self, universum):
self.core = 'select\n\n{columns}\n\nfrom {joins}\n\n where {where}'
self.graph = universum
self.active_tables = []
self.active_columns = {}
self.implicit_tables = []
self.join_strings = {}
for i in self.graph.tables:
self.join_strings[i] = self.graph.json[i]['Joins']
self.how_to_join = {}
def addTables(self, tablename):
"""
Sets given tablename to active. GUI ensures that only valid names
will be given.
"""
if not tablename in self.active_tables:
self.active_tables.append(tablename)
self.active_columns[tablename] = []
def addColumns(self, table, column):
"""
Sets given columnname from table to active. GUI ensures that only valid names
will be given.
"""
if not column in self.active_columns[table]:
self.active_columns[table].append(column)
def addWhere(self, string):
"""
Adds any string to a list to be input as where statement. This could be
vulnerable for SQL injection, but the scope of this project is in-house
usage, and the generated SQL query isn't directly passed to the server.
If no where statements have been given yet, the list is created.
"""
try:
self.where.append(string)
except AttributeError:
self.where = [string]
def findJoins(self):
"""
Calls the joinPaths function from Universe class. Figures out which joins
are needed and which tables need to be implicitly added. Returns a list
of tuples with tablenames to be joined.
"""
tags = [self.graph.json[i]['tag'][0] for i in self.active_tables]
join_paths = self.graph.joinPaths(tags)
join_sets = []
for i in join_paths:
for j, k in zip(i[:-1], i[1:]):
join_sets.append((j, k))
for sublist in join_paths:
for item in sublist:
if not item in self.active_tables:
self.addTables(item)
self.implicit_tables.append(item)
return join_sets
def joinStatement(self, table_tuple):
"""
Creates the join statement for a given tuple of tablenames. The second
entry in the tuple is always the table that is joined. Since the string
is stored in a dictionary with one specific combination of the two table
names, the try statement checks which way around it needs to be. how contains
the default way to join. Unless otherwise specified, this is used to generate
the join string.
"""
added_table = table_tuple[1]
try:
on_string, how = self.graph.json[table_tuple[0]]['Joins'][table_tuple[1]]
except (KeyError, TypeError) as error:
table_tuple = (table_tuple[1], table_tuple[0])
on_string, how = self.graph.json[table_tuple[0]]['Joins'][table_tuple[1]]
if not table_tuple in self.how_to_join.keys():
self.how_to_join[table_tuple] = how
join_string = self.how_to_join[table_tuple] + ' join ' + self.graph.json[added_table]['DBHandle'][0] + ' ' + self.graph.json[added_table]['tag'][0] + '\n'
return join_string + on_string
def selectStatement(self, table):
"""
Creates the column specification. If no columns of an active table are
specified, it assumes all the columns are wanted.
"""
if not self.active_columns[table]:
self.active_columns[table] = ['*']
return ',\n'.join([self.graph.json[table]['tag'][0] + '.' + i for i in self.active_columns[table]])
def compileQuery(self):
"""
Handles compilation of the query. If there are more than one activated
table, joins need to be handled. First the required joins are found, then
the strings that handle this are generated. The column statement is created.
If there is no where statement specified, '1=1' is added. The relevent
statements are added into the core query and returned.
"""
if len(self.active_tables) == 1:
base_table = self.active_tables[0]
self.join_statement = []
else:
joins = self.findJoins()
base_table = joins[0][0]
self.join_statement = [self.joinStatement(i) for i in joins]
self.join_statement = [self.graph.json[base_table]['DBHandle'][0] + ' ' + self.graph.json[base_table]['tag'][0]] + self.join_statement
self.join_statement = '\n\n'.join(self.join_statement)
self.column_statement = []
for i in self.active_tables:
if i not in self.implicit_tables:
self.column_statement.append(self.selectStatement(i))
self.column_statement = ',\n'.join(self.column_statement)
try:
self.where_statement = '\nand '.join(self.where)
except AttributeError:
self.where_statement = '1 = 1'
return self.core.replace('{columns}', self.column_statement).replace('{joins}', self.join_statement).replace('{where}', self.where_statement)
if __name__ == "__main__":
graph = Universe('example.JSON')
query = Query(graph)
query.addTables('table1')
query.addTables('table2')
query.addTables('table3')
print(query.compileQuery())
1 Answer 1
Good job documenting your code very thoroughly!
Here are some of the stylistic, code organizational, security and other notes:
- make sure your indentation is consistently 4 spaces (you have some places with a different indent)
- make sure to follow the
lower_case_with_underscores
naming notation - for example,compileQuery
would becomecompile_query
- improve your variable naming - remember that code is much more often read than written - choose descriptive variable names whenever possible - to pick a sample -
for i in self.tables
can becomefor table in self.tables
- beware of why having a mutable default argument can be dangerous (for instance,
path
inshortestPath()
method) use multi-line strings for your SQL queries -
core
can be defined as:self.core = """ SELECT {columns} FROM {joins} WHERE {where} """
- make sure you actually trust the source of the JSON file - you don't escape or validate query parameters and this makes your code vulnerable to SQL injections
- use
not in
. For example,if not column in self.active_columns[table]
can be replaced with a more readableif column not in self.active_columns[table]
use list (and other) comprehensions. For example, the following snippet:
self.column_statement = [] for i in self.active_tables: if i not in self.implicit_tables: self.column_statement.append(self.selectStatement(i))
can be rewritten as:
self.column_statement = [self.selectStatement(table) for table in self.active_tables if table not in self.implicit_tables]
when you check for presence of a key in a dictionary, don't use
.keys()
- e.g. replace:if not table_tuple in self.how_to_join.keys():
with:
if table_tuple not in self.how_to_join:
Overall, try to follow PEP8
style guide, look into using static code analysis tools like flake8
or pylint
, which may not only catch code style violations but can prevent real bugs and save a lot of time. A good idea would be to configure your IDE of choice to perform static code analysis checks on the fly.
Another high-level idea is to see if you can make use of ORM tools like SQLAlchemy
to help you out in query generation. The point is that, doing it manually is highly error-prone and would require a lot of careful testing to cover possible use cases.
There are other things to note - I think it might be a good idea to get this code through multiple rounds of reviews - depending on activity in this topic, of course.
-
\$\begingroup\$ Thanks for the feedback! I'll make sure to take a better look at it when I am not in the middle of friday beer day ;). make sure you actually trust the source of the JSON file This is for internal use, with database users that don't have much power in the actual database. This is why I didn't make much effort checking the JSON. \$\endgroup\$JAD– JAD2017年08月11日 15:31:34 +00:00Commented Aug 11, 2017 at 15:31
-
\$\begingroup\$ Again thanks for the feedback. I have looked into your points, and I think I fixed most of them. Personally I think mixed case for methods is nice to be able to distinguish between attributes and methods, but I'll comply. I used the static code analysis tool in Spyder, and one of the points it raised was that the Query method had too many attributes (mentioning a supposed maximum of 7). How much should I worry about that? \$\endgroup\$JAD– JAD2017年08月14日 07:23:22 +00:00Commented Aug 14, 2017 at 7:23
-
-
\$\begingroup\$ @JarkoDubbeldam thanks for applying the changes. Yes, it would be a good idea to post a new question - make sure to have the complete working code in the body of the question as well as what the code is meant to do. There is no hard rule on how many arguments a method should have and I've seen methods and functions from Python standard library that accept much more of them. Though, it still can be a sign that a method is doing too much and can be broken down into multiple methods. Thanks. \$\endgroup\$alecxe– alecxe2017年08月14日 12:50:30 +00:00Commented Aug 14, 2017 at 12:50
-
\$\begingroup\$ Woops, I meant class attributes, not method attributes/arguments. \$\endgroup\$JAD– JAD2017年08月14日 12:58:04 +00:00Commented Aug 14, 2017 at 12:58