Instead of returning None I'd raise StopIterationError . It is because it(削除) Instead of returning None I'd raise StopIterationError . It is because it is the usual thing to do when you want to stop a generator. (削除ここまで)
As @Maarten Fabré pointed out, this is the usual thing to do when you want to stop a generatorno longer valid. Instead, just use return
(see Explanation of generators, iterators, and StopIteration )
Instead of returning None I'd raise StopIterationError . It is because it is the usual thing to do when you want to stop a generator.
(削除) Instead of returning None I'd raise StopIterationError . It is because it is the usual thing to do when you want to stop a generator. (削除ここまで)
As @Maarten Fabré pointed out, this is no longer valid. Instead, just use return
(see Explanation of generators, iterators, and StopIteration )
Note: I'm assuming that this code works and you seek only 'code quality' improvements. Also, you seem like experienced programmer so I will try not to explain trivial things but please, do ask if you think I didn't state my opinion clearly enough.
Code notes:
General
No docs on any method or class. While methods are descriptive one might have trouble with for example, what is being updated in Location.update
, it might be a name update as well.
In my opinion both readchar
and readtoken
should be wrapped into some class - a Reader
maybe? Advantages of that are (at least) two: you do not clutter global namespace with functions and you could store both inputchannel, state and location as a class variables.
Imports
import collections
import io
import sys
I wouldn't import whole packages. Try to using from X import Y
pattern, it is both more descriptive (reader can see why are you using those packages) and also you don't have to repeat yourself (as you do with sys.stdout
). However, this is just a matter of preference.
Global (file) variables
I tend to move them into separate file and import them. It is mainly because of python's zen:
Namespaces are one honking great idea -- let's do more of those!
Moving your constant's away from the usage helps in the future if you would need to make them configurable - it separates 'implementation' from the usage.
Also, (if you happen to use python 3.4+) it might be beneficial to use enums instead. In addition to creating new 'namespace' it also provides a very important feature- your globals will be constant.
Location's init
I must say that I love a fact that you used name or "<input>"
, people tend to check for None with if
too much.
Location's update
The c
variable is not very descriptive, what you probably meant was 'character'/'char' ?
Location's repr and str
I personally don't like when people use str.format
method it has no advantage over "Location({}, {}, {}, {})".format
but is longer. However, if you target your classes for python 3.6+ why not using something even more readable: string interpolation.
Location's readchar
Instead of returning None I'd raise StopIterationError. It is because it is the usual thing to do when you want to stop a generator.
readtoken
- In my opinion, this method is too long, I'd split it based on state, after each
state is ...
you'd call different function. - Depending on your needs it might be useful to use linesp instead of "\n".
- Do not raise generic Exception, more on that later.
- As previously, do not return none - raise an exception instead.
Lexer's init
In this case a documentation would be lovely, particularly to understand what
inputchannel
is and what API must it comply to.In
Location("<input>", 0, 1, 0)
it would be good to name the parameters that you pass so reader wouldn't need to search for signature:Location(name="<input>", pos=0, line=1, col=0)
.What does
for token in Lexer(sys.stdin):
mean? Will the reader understand what does it mean to 'iterate over lexer' ? My preference would be to rearrange your class to something likefor token in lexer.Tokenize(sys.stdin)
.
Questions
Q1
You are right, recursion is generally discouraged in python. In general, what I saw in similar use cases people do tend to scan char by char as you did. I don't have much experience in writing lexers so I am not sure if my advice will be at all helpful but let's try:- Small functions are your friend - the more logic you abstract the more complexity you can put in. For example, instead of doing
x == '['
tryisLeftBracket(x)
, in the future you might introduceisBrachet
which will callisLeftBracet or isRightBracket
. This is just a silly example but I hope you get the point. - Instead of recursion how about using a stack of states? It is a general practice to change recursive code to iterative one.
- You could abstract your class to a state machine (basically wrapping everything as I mentioned at the beginning), this will help in managing the state and include many events as functions.
- First consume and then validate multiple characters given a state. For example, if you know that if current state is A and you encounter character
[
, next few characters are going to be some specific data
you could do this:
if state == A and current_character == '[':
consume_dataX()
def consume_dataX():
a = ''
while condition(current_character):
a += readchar()
current_character = readchar()
validate(a)
Real life example would be parsing a function (foo(int x, int y)
):
if state == ParsingFunction and current_character == "(":
argument_vector = consume_until(')')
Q2
It is completely fine to think of characters as a length 1 strings. You can think of it like this: for a given string `a` operator `[]` is a slicing operator which means that `a[1]` slices the string and is short for `a[1:2]`. Also, keep in mind that in reality `char` type is just a `byte` (or something similar) therefore, there is no need for a special treatment.Q3
I cannot stress this enough, in my opinion: **DO NOT RAISE GENERIC EXCEPTIONS** unless you have completely no idea what happened.Person using your class will have to do the following:
try:
# some code
except Exception as ex:
# handle
which is inconvenient for many reason. If you want to know more see this SO question.