Tokenizing a file with a generator

Question 1

I'm parsing (specifically tokenizing) a file, line-by-line. I've a method tokenize, that takes a string (one line of code, it can't take the whole file at once), breaks it into parts, and returns a generator that yields those parts until it reaches the end of line. This is all given. I implemented some methods, that will give me all of the file parts (i.e. feeding each line to tokenize, and sending each line parts). Here's what I did, I feel it's very cumbersome and ugly.

class tokenizer:
 def __init__(self,read_file):
 self.file = read_file
 self.line = self.file.readline()
 self.gen = tokenize(self.line)
 self.token = Token("","","","")
 def advance(self):
 return self.token
 def hasMoreTokens(self):
 try:
 self.token = self.gen.__next__()
 except StopIteration:
 self.line = self.file.readline()
 if self.line == '':
 return False
 self.gen = tokenize(self.line)
 self.token = self.gen.__next__()
 return True
with open(sys.argv[1],'r') as file:
 tknzr = tokenizer(file)
 while tknzr.hasMoreTokens():
 print(tknzr.advance())

Could you advise me on how to make a more logical and sensible implementation?

Question 2

Python already constructs a generator when you loop over a file object with a for loop. So you could simplify your code to:

with open(sys.argv[1],'r') as f:
 for line in f:
 for token in tokenize(line):
 print(token)

Note that I renamed the file variable to f, to avoid shadowing the built-in file.

If you really need to implement a class, you should implement the iterator protocol, allowing Python to iterate over your object. First, let me define a dummy tokenize function for testing purposes:

def tokenize(line):
 return line.split()

Then, let's define the class. Note that I renamed it in PascalCase, to adhere to Python's official style-guide, PEP8.

It has two important methods, first the __iter__ method, which just returns self. This just tells Python that this class is the actual iterator which it can iterate over. It is important for if you nest iter calls, namely iter(iter(tokenizer)) == iter(tokenizer).

The second important method is the __next__ method, which, just like the name suggests, tells Python how to get the next element from the iterator. It is similar to your implementation, only I use the iterator interface of the file. This method is called a second time if we got to a new line. It will stop at the end of the file, because then the unguarded next(self.file_it) will raise StopIteration, which the for loop will catch and stop iterating.

Note that since we call iter on the output of the tokenize function, it is enough for tokenize to return an iterable (this can be a list, like here, but it can also be an iterator itself).

class Tokenizer:
 def __init__(self, f):
 self.file_it = iter(f)
 self.token_it = None
 def __next__(self):
 if self.token_it is None:
 self.token_it = iter(tokenize(next(self.file_it)))
 try:
 return next(self.token_it)
 except StopIteration:
 self.token_it = None
 return next(self)
 def __iter__(self):
 return self
if __name__ == "__main__": 
 with open(sys.argv[1],'r') as f:
 for token in Tokenizer(f):
 print(token)

I also added a if __name__ == "__main__": guard around your code running the tokenizer to allow importing this class from other scripts.

Normally I would expect the tokenize function to be a method of the Tokenizer. Either directly defined:

class Tokenizer:
 ...
 def tokenize(self, line):
 return line.split()
 ...

Or, using the strategy pattern, plugged in at creation time:

def tokenize(line):
 return line.split()
class Tokenizer:
 def __init__(self, f, tokenize):
 self.f_it = iter(f)
 self.tokenize = tokenize
 self.token_it = None
 ...

Or, using inheritance:

class Tokenizer:
 ...
 def tokenize(self, line):
 raise NotImplementedError
class SplitTokenizer(Tokenizer):
 def tokenize(self, line):
 return line.split()

Question 3

Is there any documentation for file? I've been looking for it for quite some time now.

Question 4

You can type help(file) in an interactive Python session. But I just discovered that file does not exist as a built-in anymore in Python 3. So you need to open a Python2 console for that.

Graipher GraipherGraipher 41.6k7 gold badges70 silver badges134 bronze badges · Accepted Answer · 2017-10-28 08:55:26Z

Python already constructs a generator when you loop over a file object with a for loop. So you could simplify your code to:

with open(sys.argv[1],'r') as f:
 for line in f:
 for token in tokenize(line):
 print(token)

Note that I renamed the file variable to f, to avoid shadowing the built-in file.

If you really need to implement a class, you should implement the iterator protocol, allowing Python to iterate over your object. First, let me define a dummy tokenize function for testing purposes:

def tokenize(line):
 return line.split()

Then, let's define the class. Note that I renamed it in PascalCase, to adhere to Python's official style-guide, PEP8.

It has two important methods, first the __iter__ method, which just returns self. This just tells Python that this class is the actual iterator which it can iterate over. It is important for if you nest iter calls, namely iter(iter(tokenizer)) == iter(tokenizer).

The second important method is the __next__ method, which, just like the name suggests, tells Python how to get the next element from the iterator. It is similar to your implementation, only I use the iterator interface of the file. This method is called a second time if we got to a new line. It will stop at the end of the file, because then the unguarded next(self.file_it) will raise StopIteration, which the for loop will catch and stop iterating.

Note that since we call iter on the output of the tokenize function, it is enough for tokenize to return an iterable (this can be a list, like here, but it can also be an iterator itself).

class Tokenizer:
 def __init__(self, f):
 self.file_it = iter(f)
 self.token_it = None
 def __next__(self):
 if self.token_it is None:
 self.token_it = iter(tokenize(next(self.file_it)))
 try:
 return next(self.token_it)
 except StopIteration:
 self.token_it = None
 return next(self)
 def __iter__(self):
 return self
if __name__ == "__main__": 
 with open(sys.argv[1],'r') as f:
 for token in Tokenizer(f):
 print(token)

I also added a if __name__ == "__main__": guard around your code running the tokenizer to allow importing this class from other scripts.

Normally I would expect the tokenize function to be a method of the Tokenizer. Either directly defined:

class Tokenizer:
 ...
 def tokenize(self, line):
 return line.split()
 ...

Or, using the strategy pattern, plugged in at creation time:

def tokenize(line):
 return line.split()
class Tokenizer:
 def __init__(self, f, tokenize):
 self.f_it = iter(f)
 self.tokenize = tokenize
 self.token_it = None
 ...

Or, using inheritance:

class Tokenizer:
 ...
 def tokenize(self, line):
 raise NotImplementedError
class SplitTokenizer(Tokenizer):
 def tokenize(self, line):
 return line.split()

Is there any documentation for file? I've been looking for it for quite some time now.
You can type help(file) in an interactive Python session. But I just discovered that file does not exist as a built-in anymore in Python 3. So you need to open a Python2 console for that.

Stack Exchange Network

Tokenizing a file with a generator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Tokenizing a file with a generator

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions