Alternative to nonlocal for local variables in a function with nested functions in Python

Question 1

Introduction for future readers

I posted the code below in the hope that it would be a good example to discuss the topic "Alternative to nonlocal for local variables in a function with nested functions in Python".

As it turns out, the answer to that question depends a lot on whether or not the hypothetical function with nested functions would return an value/object or not.

For the case when it would not, you can find a consise and clear discussion there. For people who don't know about "nonlocal", that discussion also consisely clarifies the topic, and what I generally mean by alternatives to nonlocal.

For the case with a return value/object, the next question is whether or not that return value/object needs to provide a class interface, typically methods to access its contents. If it does, which is likely if the problem at hand is complex enough to justify a function with nested functions, the hypothetical function should not be implemented as a function but as a class, and the accessor methods to the hypothetical return value would simply be methods in that class. I.e. the problem becomes a classical class design problem.

This is the case for the code below, as shown in 200_success' and my answer.

The example itself is a tokenizer for a simple arithmetic expression. For the sake of simplicity, the code includes no error handling.

Please note that the alternatives provided in this question are not good solutions. 200_success' and my answer below provide good solutions, although error handling would need to be added.

Alternatives

Alternative one: class self

OPERATORS = '+', '-', '*', '/'
def tokenize(expression):
 def state_none(c):
 if c.isdecimal():
 self.token = c
 self.state = state_number
 elif c in OPERATORS:
 self.token = 'operator', c
 self.token_ready = True
 def state_number(c):
 if c.isdecimal():
 self.token += c
 else:
 self.char_consumed = False
 self.token = 'number', self.token
 self.token_ready = True
 self.state = state_none
 def interpret_character(c):
 self.token_ready = False
 self.char_consumed = True
 self.state(c)
 class self:
 token_ready = False
 token = None
 char_consumed = True
 state = state_none
 for c in expression:
 self.char_consumed = False
 while not self.char_consumed:
 interpret_character(c)
 if self.token_ready:
 yield self.token
 if self.state == state_number:
 yield 'number', self.token
def main():
 for x in tokenize('15+ 2 * 378 / 5'):
 print(x)
 # ('number', '15')
 # ('operator', '+')
 # ('number', '2')
 # ('operator', '*')
 # ('number', '378')
 # ('operator', '/')
 # ('number', '5')
if __name__ == '__main__':
 main()

Alternative two: callable class

I do not like it because one has to first instantiate the class in order to call the object, but it is a clear inspiration for alternative one.

OPERATORS = '+', '-', '*', '/'
class Tokenizer:
 def __init__(self, expression):
 self.expression = expression
 self.token_ready = False
 self.token = None
 self.char_consumed = True
 self.state = self.state_none
 def state_none(self, c):
 if c.isdecimal():
 self.token = c
 self.state = self.state_number
 elif c in OPERATORS:
 self.token = 'operator', c
 self.token_ready = True
 def state_number(self, c):
 if c.isdecimal():
 self.token += c
 else:
 self.char_consumed = False
 self.token = 'number', self.token
 self.token_ready = True
 self.state = self.state_none
 def interpret_character(self, c):
 self.token_ready = False
 self.char_consumed = True
 self.state(c)
 def __call__(self):
 for c in self.expression:
 self.char_consumed = False
 while not self.char_consumed:
 self.interpret_character(c)
 if self.token_ready:
 yield self.token
 if self.state == self.state_number:
 yield 'number', self.token
def main():
 for x in Tokenizer('15+ 2 * 378 / 5')():
 print(x)
 # ('number', '15')
 # ('operator', '+')
 # ('number', '2')
 # ('operator', '*')
 # ('number', '378')
 # ('operator', '/')
 # ('number', '5')
if __name__ == '__main__':
 main()

Question 2

Congratulations for recognizing that it is sometimes undesirable to write classes in Python. A rule of thumb is, if a class has two methods, one of which is the constructor, then consider writing a function instead. Such is the case with tokenize().

However, your first implementation, which uses self as a hack to mutate non-local variables, feels icky. Your object-oriented implementation is easier to understand, though more awkward to call. So, we just need to fix its call interface.

Next, you should recognize that

Generators are a simple and powerful tool for creating iterators.

What you want is an iterator. What you do not want is to write it as a generator. (Generators often offer an advantage in simplicity because they implicitly retain information in the suspended state of execution. But that doesn't help you if you want to write an explicit state machine.) Instead, you should write it as a class that implements the __next__() and __iter__() methods.

There is a minor issue, that classes are typically named using UpperCase, whereas functions are typically named using lower_case. Here, I would recommend breaking that convention, and naming the class as if it were a function. One of the underappreciated features of Python is that there is no new keyword; the distinction between functions and classes can be blurred (such as in the case of namedtuple).

I would change the way your state machine is driven.

Awkwardness arises in the way the final state is reached. Actually, you don't have a formal final state. What you have instead is this hack at the end of your __call__(), which should ideally be in the state_number() method instead:

 if self.state == self.state_number:
 yield 'number', self.token

Rather than having the main loop be responsible for iterating through the characters and feeding them to the state handlers, I would have each state handler voluntarily consume a character from the stream.

In the implementation below, I set the state to None to indicate that the final state has been reached. To avoid confusion, I have renamed state_none() to _state_neutral(). Also, instead of token and token_ready, I use partial_token to store digits of a possibly incomplete number. Complete tokens are returned immediately, rather than being temporarily stored.

OPERATORS = ('+', '-', '*', '/')
class tokenize:
 def __init__(self, expression):
 self.expr_iter = iter(expression)
 self.state = tokenize._state_neutral
 self.partial_token = ''
 def _state_neutral(self):
 c = self.partial_token or next(self.expr_iter)
 if c.isdecimal():
 self.state = tokenize._state_number
 self.partial_token = c
 elif c in OPERATORS:
 self.partial_token = ''
 return ('operator', c)
 else:
 # TODO: Error handling for illegal characters
 self.partial_token = ''
 def _state_number(self):
 try:
 c = next(self.expr_iter)
 except StopIteration:
 self.state = None
 return ('number', self.partial_token)
 if c.isdecimal():
 self.partial_token += c
 else:
 n = self.partial_token
 self.state = tokenize._state_neutral
 self.partial_token = c
 return ('number', n)
 def __iter__(self):
 return self
 def __next__(self):
 while self.state:
 token = self.state(self)
 if token:
 return token
 raise StopIteration
def main():
 for token in tokenize('15+ 2 * 378 / 5'):
 print(token)
if __name__ == "__main__":
 main()

Question 3

To summarize 200_success' answer, I should implement the iterator interface in a class, instead of trying to return a generator.

My code ported to that pattern follows. That version of the code also includes a proper handling of expression termination, which was missing in my initial code.

Good alternative: a class that implements the expected interface

OPERATORS = '+', '-', '*', '/'
class Tokenizer:
 def __init__(self, expression):
 self.token = None
 self.char_consumed = True
 self.state = Tokenizer._state_none
 self.expression = iter(expression)
 def __iter__(self):
 return self
 def _state_none(self, c):
 if c.isdecimal():
 self.token = c
 self.state = Tokenizer._state_number
 elif c in OPERATORS:
 return 'operator', c
 def _state_number(self, c):
 if c.isdecimal():
 self.token += c
 else:
 self.char_consumed = False
 self.state = Tokenizer._state_none
 return 'number', self.token
 def _interpret_character(self, c):
 self.char_consumed = True
 return self.state(self, c)
 def __next__(self):
 for c in self.expression:
 self.char_consumed = False
 while not self.char_consumed:
 token = self._interpret_character(c)
 if token:
 return token
 token = self._interpret_character('') # termination event
 if token:
 return token
 raise StopIteration
def main():
 for x in Tokenizer('15+ 2 * 378 / 5'):
 print(x)
 # ('number', '15')
 # ('operator', '+')
 # ('number', '2')
 # ('operator', '*')
 # ('number', '378')
 # ('operator', '/')
 # ('number', '5')
if __name__ == '__main__':
 main()

The main difference with 200_success' solution is that my states do not take care of the fetching of events. The code driving the FSM does that instead. This means that my __next__() method is heavier, but my states are lighter. I also do not need a specific terminating state. I made that design choice out of habit (I like to keep my states as free from non-FSM logic as possible), but it has the additional benefit to provide cheaper scaling of the state machine.

Also, since the solution is a regular class, I see no reason for not using the traditional class naming conventions.

200_success 200_success 145k22 gold badges190 silver badges478 bronze badges · Answer 1 · 2017-10-30 22:20:39Z

Congratulations for recognizing that it is sometimes undesirable to write classes in Python. A rule of thumb is, if a class has two methods, one of which is the constructor, then consider writing a function instead. Such is the case with tokenize().

However, your first implementation, which uses self as a hack to mutate non-local variables, feels icky. Your object-oriented implementation is easier to understand, though more awkward to call. So, we just need to fix its call interface.

Next, you should recognize that

Generators are a simple and powerful tool for creating iterators.

What you want is an iterator. What you do not want is to write it as a generator. (Generators often offer an advantage in simplicity because they implicitly retain information in the suspended state of execution. But that doesn't help you if you want to write an explicit state machine.) Instead, you should write it as a class that implements the __next__() and __iter__() methods.

There is a minor issue, that classes are typically named using UpperCase, whereas functions are typically named using lower_case. Here, I would recommend breaking that convention, and naming the class as if it were a function. One of the underappreciated features of Python is that there is no new keyword; the distinction between functions and classes can be blurred (such as in the case of namedtuple).

I would change the way your state machine is driven.

Awkwardness arises in the way the final state is reached. Actually, you don't have a formal final state. What you have instead is this hack at the end of your __call__(), which should ideally be in the state_number() method instead:

 if self.state == self.state_number:
 yield 'number', self.token

Rather than having the main loop be responsible for iterating through the characters and feeding them to the state handlers, I would have each state handler voluntarily consume a character from the stream.

In the implementation below, I set the state to None to indicate that the final state has been reached. To avoid confusion, I have renamed state_none() to _state_neutral(). Also, instead of token and token_ready, I use partial_token to store digits of a possibly incomplete number. Complete tokens are returned immediately, rather than being temporarily stored.

OPERATORS = ('+', '-', '*', '/')
class tokenize:
 def __init__(self, expression):
 self.expr_iter = iter(expression)
 self.state = tokenize._state_neutral
 self.partial_token = ''
 def _state_neutral(self):
 c = self.partial_token or next(self.expr_iter)
 if c.isdecimal():
 self.state = tokenize._state_number
 self.partial_token = c
 elif c in OPERATORS:
 self.partial_token = ''
 return ('operator', c)
 else:
 # TODO: Error handling for illegal characters
 self.partial_token = ''
 def _state_number(self):
 try:
 c = next(self.expr_iter)
 except StopIteration:
 self.state = None
 return ('number', self.partial_token)
 if c.isdecimal():
 self.partial_token += c
 else:
 n = self.partial_token
 self.state = tokenize._state_neutral
 self.partial_token = c
 return ('number', n)
 def __iter__(self):
 return self
 def __next__(self):
 while self.state:
 token = self.state(self)
 if token:
 return token
 raise StopIteration
def main():
 for token in tokenize('15+ 2 * 378 / 5'):
 print(token)
if __name__ == "__main__":
 main()

nilo nilo 8052 gold badges7 silver badges15 bronze badges · Answer 2 · 2017-10-31 19:53:52Z

To summarize 200_success' answer, I should implement the iterator interface in a class, instead of trying to return a generator.

My code ported to that pattern follows. That version of the code also includes a proper handling of expression termination, which was missing in my initial code.

Good alternative: a class that implements the expected interface

OPERATORS = '+', '-', '*', '/'
class Tokenizer:
 def __init__(self, expression):
 self.token = None
 self.char_consumed = True
 self.state = Tokenizer._state_none
 self.expression = iter(expression)
 def __iter__(self):
 return self
 def _state_none(self, c):
 if c.isdecimal():
 self.token = c
 self.state = Tokenizer._state_number
 elif c in OPERATORS:
 return 'operator', c
 def _state_number(self, c):
 if c.isdecimal():
 self.token += c
 else:
 self.char_consumed = False
 self.state = Tokenizer._state_none
 return 'number', self.token
 def _interpret_character(self, c):
 self.char_consumed = True
 return self.state(self, c)
 def __next__(self):
 for c in self.expression:
 self.char_consumed = False
 while not self.char_consumed:
 token = self._interpret_character(c)
 if token:
 return token
 token = self._interpret_character('') # termination event
 if token:
 return token
 raise StopIteration
def main():
 for x in Tokenizer('15+ 2 * 378 / 5'):
 print(x)
 # ('number', '15')
 # ('operator', '+')
 # ('number', '2')
 # ('operator', '*')
 # ('number', '378')
 # ('operator', '/')
 # ('number', '5')
if __name__ == '__main__':
 main()

The main difference with 200_success' solution is that my states do not take care of the fetching of events. The code driving the FSM does that instead. This means that my __next__() method is heavier, but my states are lighter. I also do not need a specific terminating state. I made that design choice out of habit (I like to keep my states as free from non-FSM logic as possible), but it has the additional benefit to provide cheaper scaling of the state machine.

Also, since the solution is a regular class, I see no reason for not using the traditional class naming conventions.

Stack Exchange Network

Alternative to nonlocal for local variables in a function with nested functions in Python

Introduction for future readers

Alternatives

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Alternative to nonlocal for local variables in a function with nested functions in Python

Introduction for future readers

Alternatives

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions