Return to Question

Notice removed Canonical answer required by Community Bot

occurred Sep 22, 2018 at 18:00

Bounty Ended with MaLiN2223's answer chosen by Community Bot

occurred Sep 22, 2018 at 18:00

Tweeted twitter.com/StackCodeReview/status/1040752322610192385

occurred Sep 15, 2018 at 0:01

Notice added Canonical answer required by Michaël Le Barbier

occurred Sep 14, 2018 at 16:27

Bounty Started worth 100 reputation by Michaël Le Barbier

occurred Sep 14, 2018 at 16:27

deleted 66 characters in body; edited tags

Source Link

edited Sep 10, 2018 at 22:58

Jamal

edited Sep 10, 2018 at 22:58

Jamal

35.2k
13
134
238

Here is my code and I would love to read your comments and suggestions about how to improve it. II have also raised a few specific questions on certain topics. The code:

I am well aware that there are tools similar to lex and yacc for Python that could have been used here, but that would defeat the purpose of learning how to write such a program in Python. II found it surprisingly hard.

Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. TheThe beginning of read_token reads like "If"If you read a double quote, discard it and do read_quotedstring"" and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.

My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track oftheof the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?

Is it good style to return NoneNone in generators? Would pass also do?

All your comments and suggestions are welcome!

Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics. The code:

Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. The beginning of read_token reads like "If you read a double quote, discard it and do read_quotedstring" and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.

My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track ofthe transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?

Is it good style to return None in generators? Would pass also do?

All your comments and suggestions are welcome!

Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics.

My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track of the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?

Is it good style to return None in generators? Would pass also do?

Source Link

asked Sep 10, 2018 at 19:19

Michaël Le Barbier

asked Sep 10, 2018 at 19:19

Michaël Le Barbier

Lexer for Apache logs in Python

I recently learned Python using the book by Mark Summerfeld and I am now doing a few katas of mine when I learn a new language. This helps to get the "how the language works" and I wrote a simple lexer for Apache logs.

An example of log records are

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

and program I wrote reads all values as tokens and output them using | as a separator, as in

64.242.88.10|-|-|07/Mar/2004:16:05:49 -0800|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1|401|12846
64.242.88.10|-|-|07/Mar/2004:16:06:51 -0800|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1|200|4523
64.242.88.10|-|-|07/Mar/2004:16:10:02 -0800|GET /mailman/listinfo/hsdivision HTTP/1.1|200|6291

The lexer can be used for more general usage though.

Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics. The code:

import collections
import io
import sys
LEXER_SKIP_WHITESPACE=0
LEXER_READ_QUOTED_STRING=1
LEXER_READ_BRACKETED_STRING=2
LEXER_READ_WORD=3
class Location:
 def __init__(self, name=None, pos=0, line=1, col=0):
 self.name = name or "<input>"
 self.line = line
 self.col = col
 self.pos = pos
 def update(self, c):
 self.pos += 1
 if c == "\n":
 self.line += 1
 self.col = 0
 else:
 self.col += 1
 def __repr__(self):
 return str.format("Location({}, {}, {}, {})", repr(self.name), repr(self.pos), repr(self.line), repr(self.col))
 def __str__(self):
 return str.format("{}: {}: line {}, column {}", self.name, self.pos, self.line, self.col)
def readchar(inputchannel, location):
 while True:
 maybechar = inputchannel.read(1)
 if maybechar == '':
 return None
 else:
 location.update(maybechar)
 yield maybechar
def readtoken(inputchannel, location):
 state = LEXER_SKIP_WHITESPACE
 token = ''
 for nextchar in readchar(inputchannel, location):
 if state is LEXER_SKIP_WHITESPACE:
 if nextchar == "\n":
 yield "\n"
 continue
 elif nextchar.isspace():
 continue
 elif nextchar == '"':
 state = LEXER_READ_QUOTED_STRING
 continue
 elif nextchar == '[':
 state = LEXER_READ_BRACKETED_STRING
 continue
 else:
 state = LEXER_READ_WORD
 token += nextchar
 continue
 elif state is LEXER_READ_QUOTED_STRING:
 if nextchar == '"':
 yield token
 token = ''
 state = LEXER_SKIP_WHITESPACE
 continue
 else:
 token += nextchar
 continue
 elif state is LEXER_READ_BRACKETED_STRING:
 if nextchar == ']':
 yield token
 token = ''
 state = LEXER_SKIP_WHITESPACE
 continue
 else:
 token += nextchar
 continue
 elif state is LEXER_READ_WORD:
 if nextchar == "\n":
 yield token
 token = ''
 state = LEXER_SKIP_WHITESPACE
 yield "\n"
 continue
 elif nextchar.isspace():
 yield token
 token = ''
 state = LEXER_SKIP_WHITESPACE
 continue
 else:
 token += nextchar
 continue
 else:
 raise Error("Impossible lexer state.")
 if state is LEXER_SKIP_WHITESPACE:
 return None
 elif state is LEXER_READ_QUOTED_STRING:
 raise Error("End of character stream in quoted string.")
 elif state is LEXER_READ_BRACKETED_STRING:
 raise Error("End of character stream in quoted string.")
 elif state is LEXER_READ_WORD:
 yield token
 return None
 else:
 raise Error("Impossible lexer state.")
class Lexer:
 def __init__(self, inputchannel, _location=None):
 self.location = _location or Location("<input>", 0, 1, 0)
 self.inputchannel = inputchannel
 self.buf = ''
 self.state = LEXER_SKIP_WHITESPACE
 def __iter__(self):
 return readtoken(self.inputchannel, self.location)
if __name__ == "__main__":
 sep = ''
 for token in Lexer(sys.stdin):
 if token == '\n':
 sys.stdout.write(token)
 sep = ''
 else:
 sys.stdout.write(sep)
 sys.stdout.write(token)
 sep = '|'

Question 1 – How to write lexers in Python?

My first misfortune is that Python does not do tail-elimination, so it is essentially forbidden to write a lexer as a set of mutually recursive functions. Mutually recursive functions are one of my favourite tools to do so, because it clearly singles out a specific state of the lexer (the recursive function it is in) and the transitions from this state to the other states, and makes it easy to test-case individually each of the transitions.

let rec read_token f ((ax,buffer) as state) s =
 match Stream.peek s with
 | Some('"') -> Stream.junk s; read_quotedstring f state s
 | Some('[') -> Stream.junk s; read_timestamp f state s
 | Some(' ')
 | Some('\t')-> Stream.junk s; read_token f state s
 | Some(c) -> read_simpleword f state s
 | None -> ax
and read_simpleword f ((ax,buffer) as state) s =
 match Stream.peek s with
 | Some('"')
 | Some('[')
 | Some(' ')
 | Some('\t') ->
 let current_token = Buffer.contents buffer in
 Buffer.clear buffer;
 read_token f (f ax current_token, buffer) s
 | Some(c) ->
 Buffer.add_char buffer c;
 Stream.junk s;
 read_simpleword f state s
 | None ->
 ax
and read_quotedstring f ((ax,buffer) as state) s =
 match Stream.peek(s) with
 | Some('"') ->
 Stream.junk s;
 let current_token = Buffer.contents buffer in
 Buffer.clear buffer;
 read_token f (f ax current_token, buffer) s
 | Some(c) ->
 Stream.junk s;
 Buffer.add_char buffer c;
 read_quotedstring f state s
 | None ->
 failwith "End of stream within a quoted string"
and read_timestamp f ((ax,buffer) as state) s =
 match Stream.peek(s) with
 | Some(']') ->
 Stream.junk s;
 let current_token = Buffer.contents buffer in
 Buffer.clear buffer;
 read_token f (f ax (timestamp_to_iso current_token), buffer) s
 | Some(c) ->
 Stream.junk s;
 Buffer.add_char buffer c;
 read_timestamp f state s
 | None ->
 failwith "End of stream within a bracketed string"

Question 2. Is it right to treat stdin as a character stream?

Obviously I somehow did that but Python has no real character type and makes them look like length 1 strings. It feels to me that the approach of reading my input in chunks into "circular expandable buffers" and making copies of substrings of these chunks to generate my tokens would fit much more nicely in the language. Am I right?

Question 3. What is the all-purpose exception in Python?

This seems like a rather basic question but I am really failing to find a suitable answer in the documentation The Exception seems a poor choice, since it very general and make error identification and error handling rather complicated.

Question 4. Do generators return none?

Is it good style to return None in generators? Would pass also do?

All your comments and suggestions are welcome!

python-3.x lexical-analysis

lang-py