I wrote a simple text parser and a converter using Python 3.12.1.
The purpose is to read a text file that contains 20000 words one per line, and create an output file that contains all the queries.
Here is a snippet of the input file with fake values:
aaaaaaaa
bbbbbbbb
cccccccc
dddddddd
I got this file from an external source and I need to save those words in a database. For that reason, I need to read the input file and create the SQL queries. I can directly run the queries on the DB, but for simplicity, here I published a version with a text file for output.
I tried to implement the separation of concerns and the single responsability concept.
Here's my code:
#!/usr/bin/env python3
from abc import ABC, abstractmethod
class ParserInterface(ABC):
@abstractmethod
def parse(self) -> None:
pass
class ConverterInterface(ABC):
@abstractmethod
def convert(self, tablename: str = '', columnname: str = '') -> None:
pass
@abstractmethod
def writeOnFile(self, filename: str = '') -> None:
pass
class Parser(ParserInterface):
def __init__(self, filename: str = '') -> None:
if not filename:
raise Exception('Filename cannot be empty!')
self.filename = filename
self.tokens = set()
def parse(self) -> None:
with open(self.filename, 'r') as file:
while line := file.readline():
self.tokens.add(line.rstrip())
class TextToMySQLConverter(ConverterInterface):
def __init__(self) -> None:
self.tokens = set()
self.queries = set()
def setTokens(self, tokens: set = set()) -> None:
self.tokens = tokens
def convert(self, tablename: str = '', columnname: str = '') -> None:
if not tablename:
raise Exception('Tablename cannot be empty!')
if not columnname:
raise Exception('Columnname cannot be empty!')
for token in self.tokens:
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
self.queries.add(query)
def writeOnFile(self, filename: str = '') -> None:
if not filename:
raise Exception('Filename cannot be empty!')
with open(filename,'w') as file:
for query in self.queries:
file.write('{0}\n'.format(query))
class Manager:
def __init__(self, parser: ParserInterface = None, converter: ConverterInterface = None) -> None:
if not parser:
raise Exception('Parser cannot be None!')
if not converter:
raise Exception('Converter cannot be None!')
self.parser = parser
self.converter = converter
self.outputFilename = ''
self.tableName = ''
self.columnName = ''
def setOutputFilename(self, filename: str = '') -> None:
if not filename:
raise Exception('Filename cannot be empty!')
self.outputFilename = filename
def setTableName(self, tableName: str = '') -> None:
if not tableName:
raise Exception('Filename cannot be empty!')
self.tableName = tableName
def setColumnName(self, columnName: str = '') -> None:
if not columnName:
raise Exception('Filename cannot be empty!')
self.columnName = columnName
def process(self):
self.parser.parse()
tokens = self.parser.tokens
converter.setTokens(tokens)
converter.convert(self.tableName, self.columnName)
converter.writeOnFile(self.outputFilename)
if __name__ == "__main__":
parser = Parser('original.txt')
converter = TextToMySQLConverter() # We can implement a strategy here
manager = Manager(parser, converter)
manager.setOutputFilename('queries.txt')
manager.setTableName("tablename")
manager.setColumnName("columname")
manager.process()
-
\$\begingroup\$ As I wrote in the description, the input file is a file that contains 20000 words one per line. \$\endgroup\$friday-json– friday-json2024年02月13日 15:48:16 +00:00Commented Feb 13, 2024 at 15:48
2 Answers 2
Naming
Method/function names, parameters, and variable names should be snake_case
Parser.parse
This doesn't have to be a while
loop, you can just use set.update
on an iterable:
class Parser(ParserInterface):
~snip~
def parse(self) -> None:
# no need for the 'r' mode, as it's the default
with open(self.filename) as fh:
self.tokens.update((line.rstrip() for line in fh))
TextToMySQLConverter.write_on_file
To write the set
to the file, simply use str.join
:
def write_on_file(self, filename: str = '') -> None:
if not filename:
raise Exception('Filename cannot be empty!')
with open(filename, 'w') as file:
file.write('\n'.join(self.queries))
Bug: converter
I think you meant to have self.converter
in Manager.process
, otherwise it's relying on a global:
def process(self):
self.parser.parse()
tokens = self.parser.tokens
# Here
converter.setTokens(tokens)
converter.convert(self.tableName, self.columnName)
converter.writeOnFile(self.outputFilename)
Getters and Setters
The Manager
class should be able to take output_filename
, table_name
, and column_name
as parameters, since you seem to know these ahead of time. This way you don't need the getter and setter methods:
class Manager:
def __init__(
self,
output_filename: str,
table_name: str,
column_name: str,
parser: ParserInterface, # Make these required
converter: ConverterInterface
) -> None:
self.parser = parser
self.converter = converter
self.output_filename = output_filename
self.table_name = table_name
self.column_name = column_name
def process(self):
self.parser.parse()
self.converter.tokens = self.parser.tokens
self.converter.convert(self.table_name, self.column_name)
self.converter.write_file(self.output_filename)
However, now we have a class with two methods, one of which is __init__
. This means that we should be able to safely convert this into a standalone function:
def parse_and_write(
parser: ParserInterface,
converter: ConverterInterface,
output_filename: str,
table_name: str,
column_name: str
):
parser.parse()
converter.tokens = parser.tokens
converter.convert(table_name, column_name)
converter.write_file(output_filename)
converter.tokens = parser.tokens
IMO, parser.tokens
should just be an argument to converter.convert
:
class TextToMySQLConverter(ConverterInterface):
def __init__(self):
self.tokens = set()
self.queries = set()
def convert(self, tokens: set[str], tablename: str, columnname: str) -> None:
for token in tokens:
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
self.queries.add(query)
Now, this can also be refactored in the same way. Instead of returning None
, return queries
:
def convert_tokens(self, tokens: set[str], tablename: str, columnname: str) -> set[str]:
queries = set()
for token in tokens:
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
queries.add(query)
return queries
What about write_to_file
? That is a function as well, we aren't carrying around state, and it's just as easy to provide queries
and output_filename
as arguments:
def write_queries(file: str, queries: set[str]) -> None:
with open(file, 'w') as fh:
fh.write('\n'.join(queries))
Parser
As above, so below. Parser
is also just a single function:
def parse(filename: str) -> set[str]:
with open(filename) as file:
return set((line.rstrip() for line in file))
main
function
Now, your process
function is really just a main
:
def parse(filename: str) -> set[str]:
with open(filename, 'r') as file:
return set((line.rstrip() for line in file))
def convert_tokens(tokens: set[str], tablename: str, columnname: str) -> set[str]:
queries = set()
for token in tokens:
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
queries.add(query)
return queries
def write_queries(file: str, queries: set[str]) -> None:
with open(file, 'w') as fh:
fh.write('\n'.join(queries))
def main(input_file: str, output_file: str, table_name: str, column_name: str):
tokens = parse(input_file)
queries = convert_tokens(tokens)
write_queries(queries=queries, file=output_file)
return
if __name__ == "__main__":
main(
input_file='original.txt',
output_file='queries.txt',
table_name='tablename',
column_name='columnname'
)
```
pass
class ParserInterface(ABC):
@abstractmethod
def parse(self) -> None:
pass
At the end there we had to write pass
,
something we generally avoid unless it's syntactically necessary.
You twice passed up the opportunity to """describe""" the responsibility of the class or method. Now for the class, no big deal, it seems pretty self explanatory. But the method is a little surprising.
Typically I would expect a "parse" verb to consume
some ugly text and return it as a beautiful data structure.
But here the helpful annotation is explaining that we
must be evaluating for side effects.
Ok, fair enough.
But then explain those side effects in a docstring, please.
And having done that, there's no longer a necessity for that pass
.
Similar remarks for convert()
.
default args
def convert(self, tablename: str = '', columnname: str = '') -> None:
...
def writeOnFile(self, filename: str = '') -> None:
The type annotation is helpful, thank you.
(I am sad the file name is of str
rather than the more natural Path
.)
I can't imagine how those empty-string defaults
would be helpful to a caller.
Seems more likely that someone would use your Public API
incorrectly, then be lulled into a sense of false confidence
in the results due to lack of diagnostic exception being raised.
Consider leaving them as type str
, with no default,
so caller is forced to supply a value.
Similarly with the Parser
ctor.
Testing with if not filename:
and then raising
is just silly, elide that part.
What you might sensibly do is insist on Path
,
and raise
unless filename.exists()
.
Similarly down in TextToMySQLConverter
and Manager
.
And if you're going to raise
,
either define an app-specific exception
or choose an appropriate one such as ValueError
.
Avoid raising Exception
,
as that prevents callers from focusing a try
block
on an expected issue that they know how to recover from.
lint
Pep-8
asks that you name it write_on_file()
.
Similarly for set_tokens()
.
open() defaults
with open(self.filename, 'r') as file:
Using a context handler for auto-close is beautiful, thank you.
nit: The 'r'
mode is default and might be elided.
If you're keen to tack on open() options,
you might consider , encoding='utf8'
.
Personally I wouldn't, but some folks
assert it should always be specified
for fear of the system default being Latin1
or some crazy thing.
The systems I work on have a sensible unicode system default,
so the concern seldom arises.
while line := file.readline():
self.tokens.add(line.rstrip())
Sure, it works, and it's walrussy. But please don't do that. Stick to idiomatic consumption of the iterable:
for line in file:
...
Hmmm, we have a lot of calls to .add()
:
self.tokens.add(line.rstrip())
Consuming the iterable could simply be done by set()
:
with open(self.filename) as file:
self.tokens = set(map(str.rstrip, file))
mutable default parameter
Please don't do this:
def setTokens(self, tokens: set = set()) -> None:
This is valid syntax, but it has surprising results
likely to trip up a future maintenance engineer.
Therefore we don't write code like this, not ever,
without adding a helpful #
comment
that offers a warning to the unwary.
SO supplies answers describing the effect.
The idiomatic way to write this would be:
def set_tokens(self, tokens: set | None = None) -> None:
self.tokens = tokens or set()
So instead of evaluating set()
just once, "at compile time"
when the class is being parsed into bytecode,
for N calls we will evaluate set()
N times, "at run time".
And yes, I know, the Optional aspect is annoying and ugly,
sorry about it, that's life.
little Bobby Tables
query = 'INSERT INTO {0} ({1}) VALUES ("{2}");'.format(tablename, columnname, token)
Ugghh! Say it isn't so!
Maybe you have a Requirements Document that patiently
explains your input file shall only contain lowercase "a" .. "z"
characters and newlines.
I don't care.
Eventually some maintenance engineer will
trip over this, hard.
Don't do it.
Use bind variables, as the DB vendor intended.
Do not attempt to merge this feature branch to main
in its current state.
(BTW, ever wonder why so many URLs include base64 encoded args?
Because someone's requirements document said that inputs
shall never include &
ampersand, ?
question mark, #
hash,
and similar punctuation. Until things changed.
So we defensively build for robustness, anticipating changes.)
nit: An INSERT is a DML "statement" with side effects,
not a side effect free "query".
If you don't like statement
,
then a vague but accurate sql
identifier would suffice.
file.write('{0}\n'.format(query))
nit: That's a curiously long way to write print(query, file=file)
.
And when you feel the need to .format()
,
usually it's better to write an f-string: `f"{query}\n"
setters
def setOutputFilename(self, filename: str = '') -> None: ...
self.outputFilename = filename
I don't understand the rationale, here.
This isn't java, we don't need to define
a set_output_filename()
routine,
when we have a public self.output_filename
attribute.
Just assign to it.
If we change our mind, we can always interpose a setter routine later, perhaps for logging or for rejecting non-existent filenames.
Similarly for those other public attributes.
Aaannnddd, given that we can't really do anything
with attributes missing, wouldn't you like
the __init__
ctor to insist that caller provides
them from the get go?
batch size
It wasn't obvious to me how often you issue a COMMIT to the DB backend, and the Review Context didn't comment on performance.
Sometimes a single BEGIN; INSERT; ... INSERT; END transaction suffices. For a large ETL task we sometimes prefer to send a thousand or ten thousand rows at a time, with a COMMIT between batches.
What you definitely want to avoid is an ETL that does COMMIT after each INSERT. That will slow things down without offering any app-level benefits. An auto-commit setting of True will sometimes arrange for more COMMITs than a code author intended.
SQLAlchemy
There's more than one ORM that could help you with this INSERT task. Consider using SQL Alchemy for it.
summary
- Write injection-free code, always.
- Avoid mutable default parameters.
- Think carefully about whether a class or method is truly so well-named, so self-explanatory, that it needs no docstring. (Sometimes we don't need one, but less often than you think.)
- Gather mandatory parameter values up front.
- Embrace
Path
. - Choose snake_case identifiers.
- Only write {get,set}ters for
_private
attributes. - Raise specific errors rather than generic
Exception
.
This codebase appears to accomplish many of its design goals.
There are enough design issues and code defects in the code that I would not be willing to delegate or accept maintenance tasks for it in its current form.