I have ported Stempel stemmer in Java (Apache Lucene) to Python. I come from Java world, so I'm afraid my translation might not be "pythonic" enough.
I would like to hear your feedback on quite representative part of the code, translation of Diff
class that applies stemming command (diff
) to a string (dest
).
@classmethod
def apply(cls, dest, diff):
"""
Apply the given patch string diff to the given string dest
:param dest: Destination string
:param diff: Patch string
:return:
"""
if diff is None:
return
if not isinstance(dest, MutableString):
raise ValueError
if not dest:
return
pos = len(dest) - 1
try:
for i in range(int(len(diff) / 2)):
cmd = diff[2 * i]
param = diff[2 * i + 1]
par_num = ord(param) - ord('a') + 1
if cmd == '-':
pos -= (par_num - 1)
elif cmd == 'R':
cls.__check_index(dest, pos)
dest[pos] = param
elif cmd == 'D':
o = pos
pos -= (par_num - 1)
cls.__check_index(dest, pos)
dest[pos:o + 1] = ''
elif cmd == 'I':
pos += 1
cls.__check_offset(dest, pos)
dest.insert(pos, param)
pos -= 1
except IndexError:
# swallow, same thing happens in original Java version
pass
@classmethod
def __check_index(cls, s, index):
if index < 0 or index >= len(s):
raise IndexError
@classmethod
def __check_offset(cls, s, offset):
if offset < 0 or offset > len(s):
raise IndexError
Some justifications of decisions I took:
The original implementation uses
StringBuffer
to manipulate characters in a string. Since Pythonstr
type is immutable I used my own classMutableString
that behaves like a Pythonlist
.Also, original logic was based on catching
IndexOutOfBoundsException
. Contrary to Java, Python allows negative indexes in a list and list ranges. Therefore, I've introduces guards like__check_X
.Java implementation uses switch/case/default clause. I translated that to if/elif/else clause in Python.
1 Answer 1
Specific suggestions:
- Python will throw an
IndexError
if you try to access an element past the end of a list, so you don't need to checkindex >= len(s)
oroffset > len(s)
. - After the simplification above I would inline the
__check
methods as assertions. - Throwing an exception and catching it again within the same context is a code smell - it's too much like
goto
. Why not justreturn
instead? Or let the user know the input could not be processed? - You don't actually use any OOP in your code, so
apply
might as well not be within a class - In the same way I think you could just use a
List[str]
fordest
. - Checking the type of an argument is unpythonic. Basically you're responsible for passing some value which can be used by the code. And with type checking (below) you could even enforce this at a test or linting stage.
- Abbreviations make code harder to read. I would expand things like
destination
,position
andcommand
. - Magic values like
R
andD
should be named constants to improve readability.
General suggestions:
black
can automatically format your code to be more idiomatic.isort
can group and sort your imports automatically.flake8
with a strict complexity limit will give you more hints to write idiomatic Python:[flake8] max-complexity = 4 ignore = W503,E203
That limit is not absolute by any means, but it's worth thinking hard whether you can keep it low whenever validation fails. For example, I'm working with a team on an application since a year now, and our complexity limit is up to 7 in only one place. Conversely, on an ugly old piece of code I wrote without static analysis support I recently found the complexity reaches 87!)
I would then recommend adding type hints everywhere and validating them using a strict
mypy
configuration:[mypy] check_untyped_defs = true disallow_untyped_defs = true ignore_missing_imports = true no_implicit_optional = true warn_redundant_casts = true warn_return_any = true warn_unused_ignores = true
As a Java developer I'm sure you'd appreciate the clarity this lends to the code.
-
\$\begingroup\$ Thanks. I've started to introduce type hinting but I've realized in many cases it makes reading the code harder. Right, it helps clarify ambiguous parts but in other parts it's seems overly verbose like in Java. What's your take on using type hinting selectively, for documentation purposes only when the code is ambiguous? \$\endgroup\$dzieciou– dzieciou2019年07月20日 10:16:33 +00:00Commented Jul 20, 2019 at 10:16
-
1\$\begingroup\$ Seems sensible if any of the types get complicated. For this algorithm I expect the finished code would make for relatively simple type hints. \$\endgroup\$l0b0– l0b02019年07月20日 10:59:53 +00:00Commented Jul 20, 2019 at 10:59