I recently learned Python using the book by Mark Summerfeld and I am now doing a few katas of mine when I learn a new language. This helps to get the "how the language works" and I wrote a simple lexer for Apache logs.
An example of log records are
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
and program I wrote reads all values as tokens and output them using |
as a separator, as in
64.242.88.10|-|-|07/Mar/2004:16:05:49 -0800|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1|401|12846
64.242.88.10|-|-|07/Mar/2004:16:06:51 -0800|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1|200|4523
64.242.88.10|-|-|07/Mar/2004:16:10:02 -0800|GET /mailman/listinfo/hsdivision HTTP/1.1|200|6291
The lexer can be used for more general usage though.
Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics.
import collections
import io
import sys
LEXER_SKIP_WHITESPACE=0
LEXER_READ_QUOTED_STRING=1
LEXER_READ_BRACKETED_STRING=2
LEXER_READ_WORD=3
class Location:
def __init__(self, name=None, pos=0, line=1, col=0):
self.name = name or "<input>"
self.line = line
self.col = col
self.pos = pos
def update(self, c):
self.pos += 1
if c == "\n":
self.line += 1
self.col = 0
else:
self.col += 1
def __repr__(self):
return str.format("Location({}, {}, {}, {})", repr(self.name), repr(self.pos), repr(self.line), repr(self.col))
def __str__(self):
return str.format("{}: {}: line {}, column {}", self.name, self.pos, self.line, self.col)
def readchar(inputchannel, location):
while True:
maybechar = inputchannel.read(1)
if maybechar == '':
return None
else:
location.update(maybechar)
yield maybechar
def readtoken(inputchannel, location):
state = LEXER_SKIP_WHITESPACE
token = ''
for nextchar in readchar(inputchannel, location):
if state is LEXER_SKIP_WHITESPACE:
if nextchar == "\n":
yield "\n"
continue
elif nextchar.isspace():
continue
elif nextchar == '"':
state = LEXER_READ_QUOTED_STRING
continue
elif nextchar == '[':
state = LEXER_READ_BRACKETED_STRING
continue
else:
state = LEXER_READ_WORD
token += nextchar
continue
elif state is LEXER_READ_QUOTED_STRING:
if nextchar == '"':
yield token
token = ''
state = LEXER_SKIP_WHITESPACE
continue
else:
token += nextchar
continue
elif state is LEXER_READ_BRACKETED_STRING:
if nextchar == ']':
yield token
token = ''
state = LEXER_SKIP_WHITESPACE
continue
else:
token += nextchar
continue
elif state is LEXER_READ_WORD:
if nextchar == "\n":
yield token
token = ''
state = LEXER_SKIP_WHITESPACE
yield "\n"
continue
elif nextchar.isspace():
yield token
token = ''
state = LEXER_SKIP_WHITESPACE
continue
else:
token += nextchar
continue
else:
raise Error("Impossible lexer state.")
if state is LEXER_SKIP_WHITESPACE:
return None
elif state is LEXER_READ_QUOTED_STRING:
raise Error("End of character stream in quoted string.")
elif state is LEXER_READ_BRACKETED_STRING:
raise Error("End of character stream in quoted string.")
elif state is LEXER_READ_WORD:
yield token
return None
else:
raise Error("Impossible lexer state.")
class Lexer:
def __init__(self, inputchannel, _location=None):
self.location = _location or Location("<input>", 0, 1, 0)
self.inputchannel = inputchannel
self.buf = ''
self.state = LEXER_SKIP_WHITESPACE
def __iter__(self):
return readtoken(self.inputchannel, self.location)
if __name__ == "__main__":
sep = ''
for token in Lexer(sys.stdin):
if token == '\n':
sys.stdout.write(token)
sep = ''
else:
sys.stdout.write(sep)
sys.stdout.write(token)
sep = '|'
Question 1 – How to write lexers in Python?
I am well aware that there are tools similar to lex and yacc for Python that could have been used here, but that would defeat the purpose of learning how to write such a program in Python. I found it surprisingly hard.
My first misfortune is that Python does not do tail-elimination, so it is essentially forbidden to write a lexer as a set of mutually recursive functions. Mutually recursive functions are one of my favourite tools to do so, because it clearly singles out a specific state of the lexer (the recursive function it is in) and the transitions from this state to the other states, and makes it easy to test-case individually each of the transitions.
Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. The beginning of read_token
reads like "If you read a double quote, discard it and do read_quotedstring
" and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.
let rec read_token f ((ax,buffer) as state) s =
match Stream.peek s with
| Some('"') -> Stream.junk s; read_quotedstring f state s
| Some('[') -> Stream.junk s; read_timestamp f state s
| Some(' ')
| Some('\t')-> Stream.junk s; read_token f state s
| Some(c) -> read_simpleword f state s
| None -> ax
and read_simpleword f ((ax,buffer) as state) s =
match Stream.peek s with
| Some('"')
| Some('[')
| Some(' ')
| Some('\t') ->
let current_token = Buffer.contents buffer in
Buffer.clear buffer;
read_token f (f ax current_token, buffer) s
| Some(c) ->
Buffer.add_char buffer c;
Stream.junk s;
read_simpleword f state s
| None ->
ax
and read_quotedstring f ((ax,buffer) as state) s =
match Stream.peek(s) with
| Some('"') ->
Stream.junk s;
let current_token = Buffer.contents buffer in
Buffer.clear buffer;
read_token f (f ax current_token, buffer) s
| Some(c) ->
Stream.junk s;
Buffer.add_char buffer c;
read_quotedstring f state s
| None ->
failwith "End of stream within a quoted string"
and read_timestamp f ((ax,buffer) as state) s =
match Stream.peek(s) with
| Some(']') ->
Stream.junk s;
let current_token = Buffer.contents buffer in
Buffer.clear buffer;
read_token f (f ax (timestamp_to_iso current_token), buffer) s
| Some(c) ->
Stream.junk s;
Buffer.add_char buffer c;
read_timestamp f state s
| None ->
failwith "End of stream within a bracketed string"
My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track of the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?
Question 2. Is it right to treat stdin as a character stream?
Obviously I somehow did that but Python has no real character type and makes them look like length 1 strings. It feels to me that the approach of reading my input in chunks into "circular expandable buffers" and making copies of substrings of these chunks to generate my tokens would fit much more nicely in the language. Am I right?
Question 3. What is the all-purpose exception in Python?
This seems like a rather basic question but I am really failing to find a suitable answer in the documentation The Exception
seems a poor choice, since it very general and make error identification and error handling rather complicated.
Question 4. Do generators return none?
Is it good style to return None
in generators? Would pass also do?
2 Answers 2
In addition to the other answers, I wanted to add my own thoughts on Q1 and Q4.
Question 1
Break up readtoken
into multiple functions.
Right now, you are using a state machine in a single recursive function to try and read different types of tokens. As mentioned in another answer, recursion is not very "pythonic". Additionally, it is hard to follow the logic in this function since the state changes in each recursion.
This logic would be easier to understand if you break the different states up into separate functions, as it appears you have done in your OCaml example. You can even use similar high-level function names. We can then change our definition of readtoken
to to route to other helper functions based on the first character we read from the stream.
def readtoken(inputchannel, location):
for nextchar in readchar(inputchannel, location):
if nextchar == '\n':
yield '\n'
elif nextchar.isspace():
continue
elif nextchar == '"':
yield read_quoted_string(inputchannel, location)
elif nextchar == '[':
yield read_bracketed_string(inputchannel, location)
else:
yield read_token(nextchar, inputchannel, location)
Note how the function immediately becomes easier to read. We are delegating our responsibilities out to other named functions, just like in your OCaml example. The purpose of readtoken
is now simply that of a router, where we are delegating work out to other functions based on the first character we read from the stream.
Each of these helper functions now has a single responsibility - read their given token type from the stream, and return the result. Note that these functions should actually return
a result, and not yield
it - these functions are not generators, the only generator here is the readtoken
function.
Helper function implementations:
def read_quoted_string(inputchannel, location):
token = ''
for nextchar in readchar(inputchannel, location):
if next_char == '"':
return token
else:
token += nextchar
raise Exception("End of character stream in quoted string.")
def read_bracketed_string(inputchannel, location):
token = ''
for nextchar in readchar(inputchannel, location):
if nextchar == ']':
return token
else:
token += nextchar
raise Exception("End of character stream in bracketed string.")
def read_token(token, inputchannel, location):
for nextchar in readchar(inputchannel, location):
if nextchar.isspace():
return token
else:
token += nextchar
return token # if we reach the end of the stream
Take note of a few things with this implementation:
- Each function becomes much more readable. We no longer have to trace through a single recursive function and keep track of the state it is in.
- We can now see that
read_quoted_string
andread_bracketed_string
have a lot of common logic. Perhaps we should combine these into a single functionread_string
with astring_delimiter
as a parameter. read_token
takes an initial token as its first parameter - this is the character that we read from the top-levelreadtokens
function. If we don't like this implementation, we could try to solve it by usingpeek
at the top-level instead.
Don't make readtoken
a generator function.
In your example, the Lexer
class has implemented __iter__
which treats it as an iterable. This means that your readtoken
does not need to be a generator function (i.e you can replace all yield
statements with return
), because the Lexer
class is already a generator that wraps around it.
Question 4
When a generator function exits, it automatically raises a StopIteration
. for
-loops and while
-loops will automatically handle this StopIteration
by ending their loops and continuing on.
Therefore, the "normal" way to end a generator is to simply end the function. return None
does accomplish this, but a more traditional method would be to simply reach the end of the function and return naturally. You can accomplish this by break
ing out of the while
-loop in your readchar
function.
Note: I'm assuming that this code works and you seek only 'code quality' improvements. Also, you seem like experienced programmer so I will try not to explain trivial things but please, do ask if you think I didn't state my opinion clearly enough.
Code notes:
General
No docs on any method or class. While methods are descriptive one might have trouble with for example, what is being updated in Location.update
, it might be a name update as well.
In my opinion both readchar
and readtoken
should be wrapped into some class - a Reader
maybe? Advantages of that are (at least) two: you do not clutter global namespace with functions and you could store both inputchannel, state and location as a class variables.
Imports
import collections
import io
import sys
I wouldn't import whole packages. Try to using from X import Y
pattern, it is both more descriptive (reader can see why are you using those packages) and also you don't have to repeat yourself (as you do with sys.stdout
). However, this is just a matter of preference.
Global (file) variables
I tend to move them into separate file and import them. It is mainly because of python's zen:
Namespaces are one honking great idea -- let's do more of those!
Moving your constant's away from the usage helps in the future if you would need to make them configurable - it separates 'implementation' from the usage.
Also, (if you happen to use python 3.4+) it might be beneficial to use enums instead. In addition to creating new 'namespace' it also provides a very important feature- your globals will be constant.
Location's init
I must say that I love a fact that you used name or "<input>"
, people tend to check for None with if
too much.
Location's update
The c
variable is not very descriptive, what you probably meant was 'character'/'char' ?
Location's repr and str
I personally don't like when people use str.format
method it has no advantage over "Location({}, {}, {}, {})".format
but is longer. However, if you target your classes for python 3.6+ why not using something even more readable: string interpolation.
Location's readchar
(削除) Instead of returning None I'd raise StopIterationError. It is because it is the usual thing to do when you want to stop a generator. (削除ここまで)
As @Maarten Fabré pointed out, this is no longer valid. Instead, just use return
(see Explanation of generators, iterators, and StopIteration )
readtoken
- In my opinion, this method is too long, I'd split it based on state, after each
state is ...
you'd call different function. - Depending on your needs it might be useful to use linesp instead of "\n".
- Do not raise generic Exception, more on that later.
- As previously, do not return none - raise an exception instead.
Lexer's init
In this case a documentation would be lovely, particularly to understand what
inputchannel
is and what API must it comply to.In
Location("<input>", 0, 1, 0)
it would be good to name the parameters that you pass so reader wouldn't need to search for signature:Location(name="<input>", pos=0, line=1, col=0)
.What does
for token in Lexer(sys.stdin):
mean? Will the reader understand what does it mean to 'iterate over lexer' ? My preference would be to rearrange your class to something likefor token in lexer.Tokenize(sys.stdin)
.
Questions
Q1
You are right, recursion is generally discouraged in python. In general, what I saw in similar use cases people do tend to scan char by char as you did. I don't have much experience in writing lexers so I am not sure if my advice will be at all helpful but let's try:
- Small functions are your friend - the more logic you abstract the more complexity you can put in. For example, instead of doing
x == '['
tryisLeftBracket(x)
, in the future you might introduceisBrachet
which will callisLeftBracet or isRightBracket
. This is just a silly example but I hope you get the point. - Instead of recursion how about using a stack of states? It is a general practice to change recursive code to iterative one.
- You could abstract your class to a state machine (basically wrapping everything as I mentioned at the beginning), this will help in managing the state and include many events as functions.
- First consume and then validate multiple characters given a state. For example, if you know that if current state is A and you encounter character
[
, next few characters are going to be some specific data
you could do this:
if state == A and current_character == '[':
consume_dataX()
def consume_dataX():
a = ''
while condition(current_character):
a += readchar()
current_character = readchar()
validate(a)
Real life example would be parsing a function (foo(int x, int y)
):
if state == ParsingFunction and current_character == "(":
argument_vector = consume_until(')')
Q2
It is completely fine to think of characters as a length 1 strings. You can think of it like this: for a given stringa
operator []
is a slicing operator which means that a[1]
slices the string and is short for a[1:2]
. Also, keep in mind that in reality char
type is just a byte
(or something similar) therefore, there is no need for a special treatment.
Q3
I cannot stress this enough, in my opinion: DO NOT RAISE GENERIC EXCEPTIONS unless you have completely no idea what happened.Person using your class will have to do the following:
try:
# some code
except Exception as ex:
# handle
which is inconvenient for many reason. If you want to know more see this SO question.
Q4
Already answered above.-
2\$\begingroup\$ about the
StopIteration
: pep-479:This PEP proposes a change to generators: when StopIteration is raised inside a generator, it is replaced it with RuntimeError. (More precisely, this happens when the exception is about to bubble out of the generator's stack frame.)
\$\endgroup\$Maarten Fabré– Maarten Fabré2018年09月18日 08:36:03 +00:00Commented Sep 18, 2018 at 8:36 -
\$\begingroup\$ @MaartenFabré thank you for that, I guess we learn something new every day :) \$\endgroup\$MaLiN2223– MaLiN22232018年09月18日 10:17:34 +00:00Commented Sep 18, 2018 at 10:17