SQL query generator

Question 1

This question has a follow-up review posted here

This is a project I have been working on. This is one of my first experiences with Python and OOP as a whole. I have written a GUI that handles the inputs for these classes, but I will ask for a separate review for that, since the question would be rather bulky when including both.

The goal of this program is to create standard SQL (SQL server) queries for everyday use. The rationale behind this is that we regularly need similar queries, and would like to prevent common mistakes in them. The focus on this question is on the Python code however.

The information about the tables and their relation to each-other is provided by a JSON file, of which I have attached a mock-up version.

The code consists of three parts:

A universe class which handles the JSON file and creates the context of the tables.
A query class, which handles the specifications of which tables to include, which columns to take, how to join each table and optional where statements.
A PyQT GUI that handles the inputs. This is excluded in this post and will be posted separately for another review.

The JSON:

{
 "graph": {
 "table1": {
 "tag": ["table1"],
 "DBHandle": ["tables.table1"],
 "Priority": [1],
 "Columns": ["a", "b", "c"],
 "Joins": {
 "table2": ["on table2.a = table1.a", "inner"],
 "table3": ["on table1.c = table3.c", "inner"]
 }
 },
 "table2": {
 "tag": ["table2"],
 "DBHandle": ["tables.table2"],
 "Priority": [2],
 "Columns": ["a", "d", "e"],
 "Joins": {
 "table3": ["on table2.d=table3.d and table2.e = table3.e", "inner"]
 }
 },
 "table3": {
 "tag": ["table3"],
 "DBHandle": ["tables.table3"],
 "Priority": [4],
 "Columns": ["c", "d", "e"],
 "Joins": []
 }
 },
 "presets": {
 "non empty b": {
 "table": ["table1"],
 "where": ["table1.b is not null"]
 }
 }
}

The Python code:

# -*- coding: utf-8 -*-
"""
Created on Thu Aug 3 14:33:44 2017
@author: jdubbeldam
"""
from json import loads
class Universe:
 """
 The Universe is a context for the Query class. It contains the information
 of the available Database tables and their relation to eachother. This 
 information is stored in a JSON file.
 """
 def __init__(self, filename):
 """
 Reads the JSON and separates the information in a presets dictionary and
 a graph dictionary. The latter contains the information of the nodes in
 the universe/graph, including relational information.
 """
 with open(filename, encoding='utf-8') as file: 
 self.json = loads(str(file.read()))
 self.presets = self.json['presets']
 self.json = self.json['graph']
 self.tables = self.json.keys()
 self.connections = self.getEdges()
 def getEdges(self):
 """
 Creates a dictionary with for each node a list of nodes that join on
 that node.
 """
 edges = {}
 for i in self.tables:
 edges[i] = []
 try:
 edges[i] += [j for j in self.json[i]['Joins'].keys()]
 except AttributeError:
 pass
 for i in edges.keys():
 for j in edges[i]:
 if not i in edges[j]:
 edges[j].append(i)
 return edges
 def shortestPath(self, start, end, path = []):
 """
 Calculates the shortest path in a graph, using the dictionary created
 in getEgdes. Adapted from https://www.python.org/doc/essays/graphs/.
 """
 path = path + [start]
 if start == end:
 return path
 if not start in self.connections.keys():
 return None
 shortest = None
 for node in self.connections[start]:
 if node not in path:
 newpath = self.shortestPath(node, end, path)
 if newpath:
 if not shortest or len(newpath) < len(shortest):
 shortest = newpath
 return shortest
 def joinPaths(self, nodes):
 """
 Extension of shortestPath to work with multiple nodes to be connected.
 The nodes are sorted based on the priority, which is taken from the JSON.
 shortestPath is called on the first two nodes, then iteratively on each
 additional node and one of the existing nodes returned by shortestPath,
 selecting the one that takes the fewest steps.
 """
 sorted_nodes = sorted([[self.json[i]['Priority'][0], i] for i in nodes])
 paths = []
 paths.append(self.shortestPath(sorted_nodes[0][1], sorted_nodes[1][1]))
 for i in range(len(sorted_nodes) - 2):
 shortest = None
 flat_paths = [item for sublist in paths for item in sublist]
 old_path = len(flat_paths)
 for j in flat_paths:
 newpath = self.shortestPath(j, sorted_nodes[i+2][1], flat_paths)
 if newpath:
 if not shortest or len(newpath[old_path:]) < len(shortest):
 shortest = newpath[old_path:]
 paths.append(shortest)
 return paths 
class Query:
 """
 Query contains the functions that allow us to build an SQL query based on
 a universe object. It maintains lists with the names of activated tables
 and, if applicable, which of their columns in a dictionary. Implicit tables
 are tables that are called, only to bridge joins from one table to another.
 Since they are not explicitly called, we don't want their columns in the query.
 how_to_join is a dictionary that allows setting joins (left, right, inner, full)
 other than the defaults imported from the JSON.
 """
 def __init__(self, universum):
 self.core = 'select\n\n{columns}\n\nfrom {joins}\n\n where {where}'
 self.graph = universum
 self.active_tables = []
 self.active_columns = {}
 self.implicit_tables = []
 self.join_strings = {}
 for i in self.graph.tables:
 self.join_strings[i] = self.graph.json[i]['Joins']
 self.how_to_join = {}
 def addTables(self, tablename):
 """
 Sets given tablename to active. GUI ensures that only valid names
 will be given.
 """
 if not tablename in self.active_tables:
 self.active_tables.append(tablename)
 self.active_columns[tablename] = []
 def addColumns(self, table, column):
 """
 Sets given columnname from table to active. GUI ensures that only valid names
 will be given.
 """
 if not column in self.active_columns[table]:
 self.active_columns[table].append(column)
 def addWhere(self, string):
 """
 Adds any string to a list to be input as where statement. This could be
 vulnerable for SQL injection, but the scope of this project is in-house
 usage, and the generated SQL query isn't directly passed to the server.
 If no where statements have been given yet, the list is created.
 """
 try:
 self.where.append(string)
 except AttributeError:
 self.where = [string]
 def findJoins(self):
 """
 Calls the joinPaths function from Universe class. Figures out which joins
 are needed and which tables need to be implicitly added. Returns a list
 of tuples with tablenames to be joined.
 """
 tags = [self.graph.json[i]['tag'][0] for i in self.active_tables]
 join_paths = self.graph.joinPaths(tags)
 join_sets = []
 for i in join_paths:
 for j, k in zip(i[:-1], i[1:]):
 join_sets.append((j, k))
 for sublist in join_paths:
 for item in sublist:
 if not item in self.active_tables:
 self.addTables(item)
 self.implicit_tables.append(item)
 return join_sets
 def joinStatement(self, table_tuple):
 """
 Creates the join statement for a given tuple of tablenames. The second
 entry in the tuple is always the table that is joined. Since the string
 is stored in a dictionary with one specific combination of the two table
 names, the try statement checks which way around it needs to be. how contains
 the default way to join. Unless otherwise specified, this is used to generate
 the join string.
 """
 added_table = table_tuple[1]
 try:
 on_string, how = self.graph.json[table_tuple[0]]['Joins'][table_tuple[1]]
 except (KeyError, TypeError) as error:
 table_tuple = (table_tuple[1], table_tuple[0])
 on_string, how = self.graph.json[table_tuple[0]]['Joins'][table_tuple[1]]
 if not table_tuple in self.how_to_join.keys():
 self.how_to_join[table_tuple] = how
 join_string = self.how_to_join[table_tuple] + ' join ' + self.graph.json[added_table]['DBHandle'][0] + ' ' + self.graph.json[added_table]['tag'][0] + '\n'
 return join_string + on_string 
 def selectStatement(self, table):
 """
 Creates the column specification. If no columns of an active table are 
 specified, it assumes all the columns are wanted.
 """
 if not self.active_columns[table]:
 self.active_columns[table] = ['*']
 return ',\n'.join([self.graph.json[table]['tag'][0] + '.' + i for i in self.active_columns[table]])
 def compileQuery(self):
 """
 Handles compilation of the query. If there are more than one activated
 table, joins need to be handled. First the required joins are found, then
 the strings that handle this are generated. The column statement is created.
 If there is no where statement specified, '1=1' is added. The relevent
 statements are added into the core query and returned.
 """
 if len(self.active_tables) == 1:
 base_table = self.active_tables[0]
 self.join_statement = []
 else:
 joins = self.findJoins()
 base_table = joins[0][0]
 self.join_statement = [self.joinStatement(i) for i in joins]
 self.join_statement = [self.graph.json[base_table]['DBHandle'][0] + ' ' + self.graph.json[base_table]['tag'][0]] + self.join_statement
 self.join_statement = '\n\n'.join(self.join_statement)
 self.column_statement = []
 for i in self.active_tables:
 if i not in self.implicit_tables:
 self.column_statement.append(self.selectStatement(i))
 self.column_statement = ',\n'.join(self.column_statement)
 try:
 self.where_statement = '\nand '.join(self.where)
 except AttributeError:
 self.where_statement = '1 = 1'
 return self.core.replace('{columns}', self.column_statement).replace('{joins}', self.join_statement).replace('{where}', self.where_statement)
if __name__ == "__main__":
 graph = Universe('example.JSON')
 query = Query(graph)
 query.addTables('table1')
 query.addTables('table2')
 query.addTables('table3')
 print(query.compileQuery())

Question 2

Good job documenting your code very thoroughly!

Here are some of the stylistic, code organizational, security and other notes:

make sure your indentation is consistently 4 spaces (you have some places with a different indent)
make sure to follow the lower_case_with_underscores naming notation - for example, compileQuery would become compile_query
improve your variable naming - remember that code is much more often read than written - choose descriptive variable names whenever possible - to pick a sample - for i in self.tables can become for table in self.tables
beware of why having a mutable default argument can be dangerous (for instance, path in shortestPath() method)

use multi-line strings for your SQL queries - core can be defined as:

self.core = """
 SELECT
 {columns}
 FROM 
 {joins}
 WHERE 
 {where}
"""

make sure you actually trust the source of the JSON file - you don't escape or validate query parameters and this makes your code vulnerable to SQL injections
use not in. For example, if not column in self.active_columns[table] can be replaced with a more readable if column not in self.active_columns[table]

use list (and other) comprehensions. For example, the following snippet:

self.column_statement = []
for i in self.active_tables:
 if i not in self.implicit_tables:
 self.column_statement.append(self.selectStatement(i))

can be rewritten as:

self.column_statement = [self.selectStatement(table) 
 for table in self.active_tables 
 if table not in self.implicit_tables]

when you check for presence of a key in a dictionary, don't use .keys() - e.g. replace:

if not table_tuple in self.how_to_join.keys():

with:

if table_tuple not in self.how_to_join:

Overall, try to follow PEP8 style guide, look into using static code analysis tools like flake8 or pylint, which may not only catch code style violations but can prevent real bugs and save a lot of time. A good idea would be to configure your IDE of choice to perform static code analysis checks on the fly.

Another high-level idea is to see if you can make use of ORM tools like SQLAlchemy to help you out in query generation. The point is that, doing it manually is highly error-prone and would require a lot of careful testing to cover possible use cases.

There are other things to note - I think it might be a good idea to get this code through multiple rounds of reviews - depending on activity in this topic, of course.

Question 3

Thanks for the feedback! I'll make sure to take a better look at it when I am not in the middle of friday beer day ;). make sure you actually trust the source of the JSON file This is for internal use, with database users that don't have much power in the actual database. This is why I didn't make much effort checking the JSON.

Question 4

Again thanks for the feedback. I have looked into your points, and I think I fixed most of them. Personally I think mixed case for methods is nice to be able to distinguish between attributes and methods, but I'll comply. I used the static code analysis tool in Spyder, and one of the points it raised was that the Query method had too many attributes (mentioning a supposed maximum of 7). How much should I worry about that?

Question 5

Regarding your comment about multiple rounds of review, I posted the changes on github, should I turn that into a new question?

Question 6

@JarkoDubbeldam thanks for applying the changes. Yes, it would be a good idea to post a new question - make sure to have the complete working code in the body of the question as well as what the code is meant to do. There is no hard rule on how many arguments a method should have and I've seen methods and functions from Python standard library that accept much more of them. Though, it still can be a sign that a method is doing too much and can be broken down into multiple methods. Thanks.

Question 7

Woops, I meant class attributes, not method attributes/arguments.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-11 15:00:55Z

Good job documenting your code very thoroughly!

Here are some of the stylistic, code organizational, security and other notes:

make sure your indentation is consistently 4 spaces (you have some places with a different indent)
make sure to follow the lower_case_with_underscores naming notation - for example, compileQuery would become compile_query
improve your variable naming - remember that code is much more often read than written - choose descriptive variable names whenever possible - to pick a sample - for i in self.tables can become for table in self.tables
beware of why having a mutable default argument can be dangerous (for instance, path in shortestPath() method)

use multi-line strings for your SQL queries - core can be defined as:

self.core = """
 SELECT
 {columns}
 FROM 
 {joins}
 WHERE 
 {where}
"""

make sure you actually trust the source of the JSON file - you don't escape or validate query parameters and this makes your code vulnerable to SQL injections
use not in. For example, if not column in self.active_columns[table] can be replaced with a more readable if column not in self.active_columns[table]

use list (and other) comprehensions. For example, the following snippet:

self.column_statement = []
for i in self.active_tables:
 if i not in self.implicit_tables:
 self.column_statement.append(self.selectStatement(i))

can be rewritten as:

self.column_statement = [self.selectStatement(table) 
 for table in self.active_tables 
 if table not in self.implicit_tables]

when you check for presence of a key in a dictionary, don't use .keys() - e.g. replace:

if not table_tuple in self.how_to_join.keys():

with:

if table_tuple not in self.how_to_join:

Overall, try to follow PEP8 style guide, look into using static code analysis tools like flake8 or pylint, which may not only catch code style violations but can prevent real bugs and save a lot of time. A good idea would be to configure your IDE of choice to perform static code analysis checks on the fly.

Another high-level idea is to see if you can make use of ORM tools like SQLAlchemy to help you out in query generation. The point is that, doing it manually is highly error-prone and would require a lot of careful testing to cover possible use cases.

There are other things to note - I think it might be a good idea to get this code through multiple rounds of reviews - depending on activity in this topic, of course.

Thanks for the feedback! I'll make sure to take a better look at it when I am not in the middle of friday beer day ;). make sure you actually trust the source of the JSON file This is for internal use, with database users that don't have much power in the actual database. This is why I didn't make much effort checking the JSON.
Again thanks for the feedback. I have looked into your points, and I think I fixed most of them. Personally I think mixed case for methods is nice to be able to distinguish between attributes and methods, but I'll comply. I used the static code analysis tool in Spyder, and one of the points it raised was that the Query method had too many attributes (mentioning a supposed maximum of 7). How much should I worry about that?
Regarding your comment about multiple rounds of review, I posted the changes on github, should I turn that into a new question?
@JarkoDubbeldam thanks for applying the changes. Yes, it would be a good idea to post a new question - make sure to have the complete working code in the body of the question as well as what the code is meant to do. There is no hard rule on how many arguments a method should have and I've seen methods and functions from Python standard library that accept much more of them. Though, it still can be a sign that a method is doing too much and can be broken down into multiple methods. Thanks.
Woops, I meant class attributes, not method attributes/arguments.

Stack Exchange Network

SQL query generator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

SQL query generator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions