First time around Code Review so please be gentle (and I am happy for comments on which angles of this post are a good fit and which less so). I'm not pretty much used to software engineering in Python, and so I figured this might be a match:
import pickle
class Lexicon:
'a lexicon holding each term and providing some lexicon services like term frequency, and term ids'
def __init__(self):
self.dictt = {}
def size(self):
return len(self.dictt)
def add(self, token):
if token in self.dictt:
self.dictt[token] = self.dictt[token] + 1
else:
self.dictt[token] = 1
return token # for traceability by the caller
def remove_low_frequency(self, min_frequency):
'removes low frequency tokens'
self.dictt = dict(filter(lambda kv: kv[1] >= min_frequency, self.dictt.items()))
def pickle(self, output_file_name):
with open(output_file_name, 'wb') as handle:
pickle.dump(self.dictt, handle, protocol=pickle.HIGHEST_PROTOCOL)
@staticmethod
def unpickle(input_file_name):
with open(input_file_name, 'rb') as handle:
new_lexicon = Lexicon()
new_lexicon.dictt = pickle.load(handle)
return new_lexicon
def equals(self, other):
return self.dictt == other.dictt
- Am I being non-idiomatic above, or, just cumbersome anywhere?
- Does my
unpickle
method actually represent potential for a huge memory leak as I'm reusing the name but putting a different dictionary into it? - What might be the Pythonic idiom for a correct implementation in these cases?
- And what is the Pythonic way of making this class safe for concurrent usage? (thread-safe)
3 Answers 3
First of all, the Lexicon
class reinvents Counter
from collections
. Maybe subclassing can help.
Second, I'd separate persistence from the class because it's completely orthogonal to the Counter
functionality (see below about shelve).
Pickle/unpickle does not represent any memory leaks. As soon as noone uses the dict, it's gone.
Then, using file (with pickled data) is not thread-safe. Depending on how you wish to implement storage for the counter, you may be interested in file locking.
There is also shelve
module in Python, which may help with storing lexicons (though, it is not thread-safe).
Hard to say what is your case now or in the future, but you may want to use some special purpose file-based storages, which are designed for performance and concurrency. For example, sqlite.
Minor notes:
in Python, it is possible to make equality comparison by defining
__eq__
method (and the mentioned Counter class just does it - no need to make any special provisions), same for thesize
- in Python it's usuallylen()
, and in case of a Counter there is no need to make another method -size
(may need__len__
method if Lexicon is not inherited from some dict)thread-safety of the
add
(if you still use it) can be fixed by usingsetdefault
method with consequent+= 1
(actually, not sure - you may need thread lock - https://stackoverflow.com/questions/23547604/python-counter-atomic-increment ). Atomicity of collections Counter's update needs to be checked and wrapped.as for performance, you can check with Python's
timeit
If you insist on preserving your add
method, you can consider a rewrite:
self.dictt.setdefault(token, 0)
self.dictt[token] += 1
(again, check if this is faster than self.dictt[token] = self.dictt.get(token, 0) + 1
). Also this discussion may be relevant: It provides defaultdict
implementation, which looks even simpler than the Counter one.
Note on concurrency: It's a bit fragile to assume multi-threading-only. For example, some misconfiguration can make your application multiprocess one. In that case you will have arbitrary results in the file, even if you will have both thread locks and file locks. Maybe, this concern is not a problem in your settings, but if you care about having terms counted right, you may need to make only single process to control the file, all others just requesting add
s to that process. Actually, a counter is something, which despite simplicity of the operation (just add one!), may cause concurrency problems and inefficiency problems with high-velocity data. All kinds of optimizations may be needed to implement a high-performance counter correctly (just one of the first examples google gives: three components in the solution, uses relational database and memcached). But elaborating on that is out of scope in this answer.
-
\$\begingroup\$ Many thanks! I'm kind of surprised that my
add
method too, is not unsafe in terms of concurrency. \$\endgroup\$matanox– matanox2018年07月14日 17:25:50 +00:00Commented Jul 14, 2018 at 17:25 -
\$\begingroup\$ Nice to know about
Counter
fromcollections
:-) does it do anything terribly more efficiently (performance) than my simple implementation? \$\endgroup\$matanox– matanox2018年07月14日 17:26:24 +00:00Commented Jul 14, 2018 at 17:26 -
\$\begingroup\$ You might want to mention that
equals()
should be__eq__
\$\endgroup\$NoOneIsHere– NoOneIsHere2018年07月14日 17:49:04 +00:00Commented Jul 14, 2018 at 17:49 -
\$\begingroup\$ @NoOneIsHere possibly care to elaborate in a short orthogonal answer?! \$\endgroup\$matanox– matanox2018年07月14日 17:59:50 +00:00Commented Jul 14, 2018 at 17:59
-
\$\begingroup\$ Updated my answer with suggestions and a note on possible concurrency issues. \$\endgroup\$Roman Susi– Roman Susi2018年07月14日 18:25:24 +00:00Commented Jul 14, 2018 at 18:25
A small extra note
In Python, the __<function>__
functions are automatically run (if written) for certain cases. For example, a.__eq__(b)
is run for a == b
. These functions are sometimes called implicitly, as Daniel points out. For example, a == b
calls the default implementation of a.__eq__(b)
if it is not defined explicitly, mimicking a is b
. Therefore, your equals()
function should be __eq__
. Similarly, size()
should be __len__
.
Your add method can be simplified if you use a collections.defaultdict
:
class Lexicon:
'a lexicon holding each term and providing some lexicon services like term frequency, and term ids'
def __init__(self):
self.dictt = defaultdict(int)
...
def add(self, token):
self.dictt[token] += 1
This would however loose the "traceability" (whatever that means). However you can easily provide a method to check whether a token is contained within the dict:
def __contains__(self, token):
return token in self.dictt
This is imho a cleaner approach, since it separates your concerns.
You currently use pickle, to store the lexicon as a file. Consider using json.{loads,dumps}
instead, which can easily be reused within a possible web application context.
Explore related questions
See similar questions with these tags.