I'm parsing (specifically tokenizing) a file, line-by-line. I've a method tokenize, that takes a string (one line of code, it can't take the whole file at once), breaks it into parts, and returns a generator that yields those parts until it reaches the end of line. This is all given. I implemented some methods, that will give me all of the file parts (i.e. feeding each line to tokenize, and sending each line parts). Here's what I did, I feel it's very cumbersome and ugly.
class tokenizer:
def __init__(self,read_file):
self.file = read_file
self.line = self.file.readline()
self.gen = tokenize(self.line)
self.token = Token("","","","")
def advance(self):
return self.token
def hasMoreTokens(self):
try:
self.token = self.gen.__next__()
except StopIteration:
self.line = self.file.readline()
if self.line == '':
return False
self.gen = tokenize(self.line)
self.token = self.gen.__next__()
return True
with open(sys.argv[1],'r') as file:
tknzr = tokenizer(file)
while tknzr.hasMoreTokens():
print(tknzr.advance())
Could you advise me on how to make a more logical and sensible implementation?
1 Answer 1
Python already constructs a generator when you loop over a file object with a for loop. So you could simplify your code to:
with open(sys.argv[1],'r') as f:
for line in f:
for token in tokenize(line):
print(token)
Note that I renamed the file
variable to f
, to avoid shadowing the built-in file
.
If you really need to implement a class, you should implement the iterator protocol, allowing Python to iterate over your object. First, let me define a dummy tokenize
function for testing purposes:
def tokenize(line):
return line.split()
Then, let's define the class. Note that I renamed it in PascalCase
, to adhere to Python's official style-guide, PEP8.
It has two important methods, first the __iter__
method, which just returns self
. This just tells Python that this class is the actual iterator which it can iterate over. It is important for if you nest iter
calls, namely iter(iter(tokenizer)) == iter(tokenizer)
.
The second important method is the __next__
method, which, just like the name suggests, tells Python how to get the next element from the iterator. It is similar to your implementation, only I use the iterator interface of the file. This method is called a second time if we got to a new line. It will stop at the end of the file, because then the unguarded next(self.file_it)
will raise StopIteration
, which the for
loop will catch and stop iterating.
Note that since we call iter
on the output of the tokenize
function, it is enough for tokenize
to return an iterable (this can be a list
, like here, but it can also be an iterator itself).
class Tokenizer:
def __init__(self, f):
self.file_it = iter(f)
self.token_it = None
def __next__(self):
if self.token_it is None:
self.token_it = iter(tokenize(next(self.file_it)))
try:
return next(self.token_it)
except StopIteration:
self.token_it = None
return next(self)
def __iter__(self):
return self
if __name__ == "__main__":
with open(sys.argv[1],'r') as f:
for token in Tokenizer(f):
print(token)
I also added a if __name__ == "__main__":
guard around your code running the tokenizer to allow importing this class from other scripts.
Normally I would expect the tokenize
function to be a method of the Tokenizer
. Either directly defined:
class Tokenizer:
...
def tokenize(self, line):
return line.split()
...
Or, using the strategy pattern, plugged in at creation time:
def tokenize(line):
return line.split()
class Tokenizer:
def __init__(self, f, tokenize):
self.f_it = iter(f)
self.tokenize = tokenize
self.token_it = None
...
Or, using inheritance:
class Tokenizer:
...
def tokenize(self, line):
raise NotImplementedError
class SplitTokenizer(Tokenizer):
def tokenize(self, line):
return line.split()
-
\$\begingroup\$ Is there any documentation for
file
? I've been looking for it for quite some time now. \$\endgroup\$Daniel– Daniel2017年10月30日 12:32:30 +00:00Commented Oct 30, 2017 at 12:32 -
1\$\begingroup\$ You can type
help(file)
in an interactive Python session. But I just discovered thatfile
does not exist as a built-in anymore in Python 3. So you need to open a Python2 console for that. \$\endgroup\$Graipher– Graipher2017年10月30日 12:35:57 +00:00Commented Oct 30, 2017 at 12:35