Introduction for future readers
I posted the code below in the hope that it would be a good example to discuss the topic "Alternative to nonlocal for local variables in a function with nested functions in Python".
As it turns out, the answer to that question depends a lot on whether or not the hypothetical function with nested functions would return an value/object or not.
For the case when it would not, you can find a consise and clear discussion there. For people who don't know about "nonlocal", that discussion also consisely clarifies the topic, and what I generally mean by alternatives to nonlocal.
For the case with a return value/object, the next question is whether or not that return value/object needs to provide a class interface, typically methods to access its contents. If it does, which is likely if the problem at hand is complex enough to justify a function with nested functions, the hypothetical function should not be implemented as a function but as a class, and the accessor methods to the hypothetical return value would simply be methods in that class. I.e. the problem becomes a classical class design problem.
This is the case for the code below, as shown in 200_success' and my answer.
The example itself is a tokenizer for a simple arithmetic expression. For the sake of simplicity, the code includes no error handling.
Please note that the alternatives provided in this question are not good solutions. 200_success' and my answer below provide good solutions, although error handling would need to be added.
Alternatives
Alternative one: class self
OPERATORS = '+', '-', '*', '/'
def tokenize(expression):
def state_none(c):
if c.isdecimal():
self.token = c
self.state = state_number
elif c in OPERATORS:
self.token = 'operator', c
self.token_ready = True
def state_number(c):
if c.isdecimal():
self.token += c
else:
self.char_consumed = False
self.token = 'number', self.token
self.token_ready = True
self.state = state_none
def interpret_character(c):
self.token_ready = False
self.char_consumed = True
self.state(c)
class self:
token_ready = False
token = None
char_consumed = True
state = state_none
for c in expression:
self.char_consumed = False
while not self.char_consumed:
interpret_character(c)
if self.token_ready:
yield self.token
if self.state == state_number:
yield 'number', self.token
def main():
for x in tokenize('15+ 2 * 378 / 5'):
print(x)
# ('number', '15')
# ('operator', '+')
# ('number', '2')
# ('operator', '*')
# ('number', '378')
# ('operator', '/')
# ('number', '5')
if __name__ == '__main__':
main()
Alternative two: callable class
I do not like it because one has to first instantiate the class in order to call the object, but it is a clear inspiration for alternative one.
OPERATORS = '+', '-', '*', '/'
class Tokenizer:
def __init__(self, expression):
self.expression = expression
self.token_ready = False
self.token = None
self.char_consumed = True
self.state = self.state_none
def state_none(self, c):
if c.isdecimal():
self.token = c
self.state = self.state_number
elif c in OPERATORS:
self.token = 'operator', c
self.token_ready = True
def state_number(self, c):
if c.isdecimal():
self.token += c
else:
self.char_consumed = False
self.token = 'number', self.token
self.token_ready = True
self.state = self.state_none
def interpret_character(self, c):
self.token_ready = False
self.char_consumed = True
self.state(c)
def __call__(self):
for c in self.expression:
self.char_consumed = False
while not self.char_consumed:
self.interpret_character(c)
if self.token_ready:
yield self.token
if self.state == self.state_number:
yield 'number', self.token
def main():
for x in Tokenizer('15+ 2 * 378 / 5')():
print(x)
# ('number', '15')
# ('operator', '+')
# ('number', '2')
# ('operator', '*')
# ('number', '378')
# ('operator', '/')
# ('number', '5')
if __name__ == '__main__':
main()
2 Answers 2
Congratulations for recognizing that it is sometimes undesirable to write classes in Python. A rule of thumb is, if a class has two methods, one of which is the constructor, then consider writing a function instead. Such is the case with tokenize()
.
However, your first implementation, which uses self
as a hack to mutate non-local variables, feels icky. Your object-oriented implementation is easier to understand, though more awkward to call. So, we just need to fix its call interface.
Next, you should recognize that
Generators are a simple and powerful tool for creating iterators.
What you want is an iterator. What you do not want is to write it as a generator. (Generators often offer an advantage in simplicity because they implicitly retain information in the suspended state of execution. But that doesn't help you if you want to write an explicit state machine.) Instead, you should write it as a class that implements the __next__()
and __iter__()
methods.
There is a minor issue, that classes are typically named using UpperCase
, whereas functions are typically named using lower_case
. Here, I would recommend breaking that convention, and naming the class as if it were a function. One of the underappreciated features of Python is that there is no new
keyword; the distinction between functions and classes can be blurred (such as in the case of namedtuple
).
I would change the way your state machine is driven.
Awkwardness arises in the way the final state is reached. Actually, you don't have a formal final state. What you have instead is this hack at the end of your __call__()
, which should ideally be in the state_number()
method instead:
if self.state == self.state_number: yield 'number', self.token
Rather than having the main loop be responsible for iterating through the characters and feeding them to the state handlers, I would have each state handler voluntarily consume a character from the stream.
In the implementation below, I set the state to None
to indicate that the final state has been reached. To avoid confusion, I have renamed state_none()
to _state_neutral()
. Also, instead of token
and token_ready
, I use partial_token
to store digits of a possibly incomplete number. Complete tokens are returned immediately, rather than being temporarily stored.
OPERATORS = ('+', '-', '*', '/')
class tokenize:
def __init__(self, expression):
self.expr_iter = iter(expression)
self.state = tokenize._state_neutral
self.partial_token = ''
def _state_neutral(self):
c = self.partial_token or next(self.expr_iter)
if c.isdecimal():
self.state = tokenize._state_number
self.partial_token = c
elif c in OPERATORS:
self.partial_token = ''
return ('operator', c)
else:
# TODO: Error handling for illegal characters
self.partial_token = ''
def _state_number(self):
try:
c = next(self.expr_iter)
except StopIteration:
self.state = None
return ('number', self.partial_token)
if c.isdecimal():
self.partial_token += c
else:
n = self.partial_token
self.state = tokenize._state_neutral
self.partial_token = c
return ('number', n)
def __iter__(self):
return self
def __next__(self):
while self.state:
token = self.state(self)
if token:
return token
raise StopIteration
def main():
for token in tokenize('15+ 2 * 378 / 5'):
print(token)
if __name__ == "__main__":
main()
To summarize 200_success' answer, I should implement the iterator interface in a class, instead of trying to return a generator.
My code ported to that pattern follows. That version of the code also includes a proper handling of expression termination, which was missing in my initial code.
Good alternative: a class that implements the expected interface
OPERATORS = '+', '-', '*', '/'
class Tokenizer:
def __init__(self, expression):
self.token = None
self.char_consumed = True
self.state = Tokenizer._state_none
self.expression = iter(expression)
def __iter__(self):
return self
def _state_none(self, c):
if c.isdecimal():
self.token = c
self.state = Tokenizer._state_number
elif c in OPERATORS:
return 'operator', c
def _state_number(self, c):
if c.isdecimal():
self.token += c
else:
self.char_consumed = False
self.state = Tokenizer._state_none
return 'number', self.token
def _interpret_character(self, c):
self.char_consumed = True
return self.state(self, c)
def __next__(self):
for c in self.expression:
self.char_consumed = False
while not self.char_consumed:
token = self._interpret_character(c)
if token:
return token
token = self._interpret_character('') # termination event
if token:
return token
raise StopIteration
def main():
for x in Tokenizer('15+ 2 * 378 / 5'):
print(x)
# ('number', '15')
# ('operator', '+')
# ('number', '2')
# ('operator', '*')
# ('number', '378')
# ('operator', '/')
# ('number', '5')
if __name__ == '__main__':
main()
The main difference with 200_success' solution is that my states do not take care of the fetching of events. The code driving the FSM does that instead. This means that my __next__()
method is heavier, but my states are lighter. I also do not need a specific terminating state. I made that design choice out of habit (I like to keep my states as free from non-FSM logic as possible), but it has the additional benefit to provide cheaper scaling of the state machine.
Also, since the solution is a regular class, I see no reason for not using the traditional class naming conventions.
Explore related questions
See similar questions with these tags.