Python lexicon class

Question 1

First time around Code Review so please be gentle (and I am happy for comments on which angles of this post are a good fit and which less so). I'm not pretty much used to software engineering in Python, and so I figured this might be a match:

import pickle
class Lexicon:
 'a lexicon holding each term and providing some lexicon services like term frequency, and term ids'
 def __init__(self):
 self.dictt = {}
 def size(self):
 return len(self.dictt)
 def add(self, token):
 if token in self.dictt:
 self.dictt[token] = self.dictt[token] + 1
 else:
 self.dictt[token] = 1
 return token # for traceability by the caller
 def remove_low_frequency(self, min_frequency):
 'removes low frequency tokens'
 self.dictt = dict(filter(lambda kv: kv[1] >= min_frequency, self.dictt.items()))
 def pickle(self, output_file_name):
 with open(output_file_name, 'wb') as handle:
 pickle.dump(self.dictt, handle, protocol=pickle.HIGHEST_PROTOCOL)
 @staticmethod
 def unpickle(input_file_name):
 with open(input_file_name, 'rb') as handle:
 new_lexicon = Lexicon()
 new_lexicon.dictt = pickle.load(handle)
 return new_lexicon
 def equals(self, other):
 return self.dictt == other.dictt

Am I being non-idiomatic above, or, just cumbersome anywhere?
Does my unpickle method actually represent potential for a huge memory leak as I'm reusing the name but putting a different dictionary into it?
What might be the Pythonic idiom for a correct implementation in these cases?
And what is the Pythonic way of making this class safe for concurrent usage? (thread-safe)

Question 2

First of all, the Lexicon class reinvents Counter from collections. Maybe subclassing can help.

Second, I'd separate persistence from the class because it's completely orthogonal to the Counter functionality (see below about shelve).

Pickle/unpickle does not represent any memory leaks. As soon as noone uses the dict, it's gone.

Then, using file (with pickled data) is not thread-safe. Depending on how you wish to implement storage for the counter, you may be interested in file locking.

There is also shelve module in Python, which may help with storing lexicons (though, it is not thread-safe).

Hard to say what is your case now or in the future, but you may want to use some special purpose file-based storages, which are designed for performance and concurrency. For example, sqlite.

Minor notes:

in Python, it is possible to make equality comparison by defining __eq__ method (and the mentioned Counter class just does it - no need to make any special provisions), same for the size - in Python it's usually len(), and in case of a Counter there is no need to make another method - size (may need __len__ method if Lexicon is not inherited from some dict)
thread-safety of the add (if you still use it) can be fixed by using setdefault method with consequent += 1 (actually, not sure - you may need thread lock - https://stackoverflow.com/questions/23547604/python-counter-atomic-increment ). Atomicity of collections Counter's update needs to be checked and wrapped.
as for performance, you can check with Python's timeit

If you insist on preserving your add method, you can consider a rewrite:

self.dictt.setdefault(token, 0)
self.dictt[token] += 1

(again, check if this is faster than self.dictt[token] = self.dictt.get(token, 0) + 1 ). Also this discussion may be relevant: It provides defaultdict implementation, which looks even simpler than the Counter one.

Note on concurrency: It's a bit fragile to assume multi-threading-only. For example, some misconfiguration can make your application multiprocess one. In that case you will have arbitrary results in the file, even if you will have both thread locks and file locks. Maybe, this concern is not a problem in your settings, but if you care about having terms counted right, you may need to make only single process to control the file, all others just requesting adds to that process. Actually, a counter is something, which despite simplicity of the operation (just add one!), may cause concurrency problems and inefficiency problems with high-velocity data. All kinds of optimizations may be needed to implement a high-performance counter correctly (just one of the first examples google gives: three components in the solution, uses relational database and memcached). But elaborating on that is out of scope in this answer.

Question 3

Many thanks! I'm kind of surprised that my add method too, is not unsafe in terms of concurrency.

Question 4

Nice to know about Counter from collections :-) does it do anything terribly more efficiently (performance) than my simple implementation?

Question 5

You might want to mention that equals() should be __eq__

Question 6

@NoOneIsHere possibly care to elaborate in a short orthogonal answer?!

Question 7

Updated my answer with suggestions and a note on possible concurrency issues.

Question 8

A small extra note

In Python, the __<function>__ functions are automatically run (if written) for certain cases. For example, a.__eq__(b) is run for a == b. These functions are sometimes called implicitly, as Daniel points out. For example, a == b calls the default implementation of a.__eq__(b) if it is not defined explicitly, mimicking a is b. Therefore, your equals() function should be __eq__. Similarly, size() should be __len__.

Question 9

Your add method can be simplified if you use a collections.defaultdict:

class Lexicon:
 'a lexicon holding each term and providing some lexicon services like term frequency, and term ids'
 def __init__(self):
 self.dictt = defaultdict(int)
 ...
 def add(self, token):
 self.dictt[token] += 1

This would however loose the "traceability" (whatever that means). However you can easily provide a method to check whether a token is contained within the dict:

def __contains__(self, token):
 return token in self.dictt

This is imho a cleaner approach, since it separates your concerns.

You currently use pickle, to store the lexicon as a file. Consider using json.{loads,dumps} instead, which can easily be reused within a possible web application context.

Roman Susi Roman Susi 9738 silver badges11 bronze badges · Answer 1 · 2018-07-14 16:01:32Z

First of all, the Lexicon class reinvents Counter from collections. Maybe subclassing can help.

Second, I'd separate persistence from the class because it's completely orthogonal to the Counter functionality (see below about shelve).

Pickle/unpickle does not represent any memory leaks. As soon as noone uses the dict, it's gone.

Then, using file (with pickled data) is not thread-safe. Depending on how you wish to implement storage for the counter, you may be interested in file locking.

There is also shelve module in Python, which may help with storing lexicons (though, it is not thread-safe).

Hard to say what is your case now or in the future, but you may want to use some special purpose file-based storages, which are designed for performance and concurrency. For example, sqlite.

Minor notes:

in Python, it is possible to make equality comparison by defining __eq__ method (and the mentioned Counter class just does it - no need to make any special provisions), same for the size - in Python it's usually len(), and in case of a Counter there is no need to make another method - size (may need __len__ method if Lexicon is not inherited from some dict)
thread-safety of the add (if you still use it) can be fixed by using setdefault method with consequent += 1 (actually, not sure - you may need thread lock - https://stackoverflow.com/questions/23547604/python-counter-atomic-increment ). Atomicity of collections Counter's update needs to be checked and wrapped.
as for performance, you can check with Python's timeit

If you insist on preserving your add method, you can consider a rewrite:

self.dictt.setdefault(token, 0)
self.dictt[token] += 1

(again, check if this is faster than self.dictt[token] = self.dictt.get(token, 0) + 1 ). Also this discussion may be relevant: It provides defaultdict implementation, which looks even simpler than the Counter one.

Note on concurrency: It's a bit fragile to assume multi-threading-only. For example, some misconfiguration can make your application multiprocess one. In that case you will have arbitrary results in the file, even if you will have both thread locks and file locks. Maybe, this concern is not a problem in your settings, but if you care about having terms counted right, you may need to make only single process to control the file, all others just requesting adds to that process. Actually, a counter is something, which despite simplicity of the operation (just add one!), may cause concurrency problems and inefficiency problems with high-velocity data. All kinds of optimizations may be needed to implement a high-performance counter correctly (just one of the first examples google gives: three components in the solution, uses relational database and memcached). But elaborating on that is out of scope in this answer.

Many thanks! I'm kind of surprised that my add method too, is not unsafe in terms of concurrency.
Nice to know about Counter from collections :-) does it do anything terribly more efficiently (performance) than my simple implementation?
You might want to mention that equals() should be __eq__
@NoOneIsHere possibly care to elaborate in a short orthogonal answer?!
Updated my answer with suggestions and a note on possible concurrency issues.

NoOneIsHere NoOneIsHere 2071 silver badge9 bronze badges · Answer 2 · 2018-07-14 18:02:13Z

A small extra note

In Python, the __<function>__ functions are automatically run (if written) for certain cases. For example, a.__eq__(b) is run for a == b. These functions are sometimes called implicitly, as Daniel points out. For example, a == b calls the default implementation of a.__eq__(b) if it is not defined explicitly, mimicking a is b. Therefore, your equals() function should be __eq__. Similarly, size() should be __len__.

Richard Neumann Richard Neumann 6,39122 silver badges41 bronze badges · Answer 3 · 2018-07-16 08:22:31Z

Your add method can be simplified if you use a collections.defaultdict:

class Lexicon:
 'a lexicon holding each term and providing some lexicon services like term frequency, and term ids'
 def __init__(self):
 self.dictt = defaultdict(int)
 ...
 def add(self, token):
 self.dictt[token] += 1

This would however loose the "traceability" (whatever that means). However you can easily provide a method to check whether a token is contained within the dict:

def __contains__(self, token):
 return token in self.dictt

This is imho a cleaner approach, since it separates your concerns.

You currently use pickle, to store the lexicon as a file. Consider using json.{loads,dumps} instead, which can easily be reused within a possible web application context.

Stack Exchange Network

Python lexicon class

3 Answers 3

A small extra note

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python lexicon class

3 Answers 3

A small extra note

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions