Recently I've been interested in language design and text processing. And a common program needed in both of those fields is a lexer. And since I found myself writing a new, slightly different lexer for each project I was working on, I deiced I'd make a simple regex based lexer instead.
I designed the code to be easy to use and extensible so I could get up and running quickly. Then, after I figured out the other parts, I could come back and change the lexer to something more solid if I needed to.
Actual, I made the lexer into a bit of a package I dubbed "plex", short for "Python lexer". It's pretty trivial, but I figured it would benefit other people who were trying to bootstrap their own projects.
The documentation and the README in the link above both explain my code pretty throughly, so I'll avoid repeating myself, and just let the code do the talking:
"""Lighweight regex based lexer.
This module provides a lighweight, regex based lexer. It can
be used for a varity of task and is very extensible.
The entry point of the module is lexer.Lexer(), which creates a new lexer
instance. Token regexes are specfied using decortators defined in the Lexer
class. The lexer also allows you to override its default error method using
a decorator. The functions that are decoratored can be used to take certain actions
when a token is found.
tokens returned by the lexer are Token objects. Each token object has
a value, a type, and a two-tuple position variable, representing the
tokens line and column number.
Whenever the lexer cannot match any rules specfied, a LexerError
is raised.
Here is a minmal example demonstrating the general structure of the
module's parts.
from plex.lexer import Lexer
lexer = Lexer()
lexer.setup('1 + 2')
@lexer.on_match('\d+')
def INTEGER(self, token):
return token
@lexer.on_match('(\+|\-|\*|/)')
def OPERATOR(self, token):
return token
for token in lexer:
print(token)
"""
__author__ = 'Christian Dean <[email protected]>'
__all__ = ['Lexer', 'LexerError', 'Token']
import re
WHITESPACE = "\s+"""
LEXER_ERR_MSG = \
"""
lexing error at line {}, column {}:
{}
{}^
invalid character in source
"""
class PatternAction:
__slots__ = ('pattern', 'action')
def __init__(self, pattern, action):
self.pattern = pattern
self.action = action
def __repr__(self):
return 'PatternAction(pattern={}, action={})'.format(
self.pattern, self.action.__name__
)
class Token:
"""Object to hold token data.
Each Token object has the
following public fields:
value : str
The textual value of the token.
type : str
The type of the token.
pos : tuple
A two tuple variable, where the first element is the
line number of the token, and the second is the column.
"""
__slots__ = ('value', 'type', 'pos')
def __init__(self, value, type, pos):
self.value = value
self.type = type
self.pos = pos
def __repr__(self):
return 'Token(value={}, type={}, pos={})'.format(
self.value, self.type, self.pos
)
class LexerError(Exception):
"""Exception to raise if the parser encounters an error."""
pass
class Lexer:
"""Lexer object.
Public Methods
--------------
setup : method
Feed the lexer an input buffer.
get_pos : method
The the current position of the lexer.
on_match : method
The decorator used for specifiy token regexes. Takes
in a single pattern for matching a token, and the function
itself for the action to be "done" when said token is found.
on_error : method
The decorator used to override the default error method of the
Lexer class.
Public Attributes
-----------------
buffer : str
The buffer of text the lexer will lex.
pos : int
The current position of the lexer in the source.
col_no : int
The current column number the lexer is at.
"""
def __init__(self):
self.buffer = ''
self.pos = 0
self.col_no = 0
self._line_start = 0
self._error_func = None
self._ignore_ws = True
self._ws_pattern = re.compile(WHITESPACE)
self._rules = {}
def setup(self, buffer, ignore_ws=True):
""" Feed the lexer an input buffer.
Parameters
----------
buffer : str
The stirng for the lexer to tokenize.
ignore_ws : bool
This deterimes whether or not the lexer skips whitespace.
The default is True.
"""
self.buffer = buffer
self.pos = 0
self.col_no = 0
self._ignore_ws = ignore_ws
def _token(self):
_buffer = self.buffer
_ignore_ws = self._ignore_ws
_rules = self._rules
if self._ignore_ws:
match = self._ws_pattern.match(_buffer, self.pos)
if match:
if '\n' in match.group(0):
self._line_start = match.end()
self.pos = match.end()
if self.pos >= len(_buffer):
return None
else:
for token_name, pattern in _rules.items():
match = pattern.pattern.match(_buffer, self.pos)
if match:
token = Token(match.group(0), token_name, self.get_pos())
modfied_token = pattern.action(self, token)
self.pos = match.end()
return modfied_token
self._error()
def get_pos(self):
"""The current internal line and column number of the lexer."""
line_no = self.buffer.count('\n', 0, self.pos)
return line_no, self.pos - self._line_start
def _error(self):
if self._error_func is None:
line, column = self.get_pos()
source_line = self.buffer.split('\n')[line]
error_msg = LEXER_ERR_MSG.format(
line, column, source_line, ' ' * column
)
raise LexerError(error_msg)
else:
self._error_func(self, self.buffer[self._pos])
def __iter__(self):
return self
def __next__(self):
token = self._token()
if token is not None:
return token
raise StopIteration()
# Python 2 support
next = __next__
def on_match(self, pattern):
"""Decorator for specifying token rules.
Rules given should be a valid regex. The name
of the decorated function, will be used as the
token type's name.
Decorated functions should accept two arguments.
The first is an instance of the lexer class, and
the second is the token object created if the token
pattern is matched.
Parameters
----------
pattern : str
The pattern which defines the current
token rule.
Example usage(assuming you've already made an instance of Lexer):
@lexer.on_match('\d+')
def DIGITS(self, token):
print('found: %s' % token.value)
return token
The token rule above will match any whole number. If the token
pattern does match, the function will be called, the lexer instance
and the token object will be passed in, and the token will be returned.
The decorated function is allowed to modify the token object in any
way. But the function *MUST* return a Token object.
"""
def decorator(func):
compiled_pattern = re.compile(pattern)
self._rules[func.__name__] = PatternAction(
compiled_pattern , func
)
return func
return decorator
def on_error(self, func):
"""Decorator for overriding the default error function.
Parameters
----------
func : function
The decorated error function. The function is allowed
to do whatever is sees fit when an error is encountered,
including skipping the character to ignore unreconized
characters.
Example usage(assuming you've already made an instance of Lexer):
@lexer.on_error
def error(self, value):
raise Exception('My custom error!!')
The decorated function should accept two arguments. The first is
a lexer instance, the second is the value which caused the lexer
to raise an error.
"""
self._error_func = func
return func
Here are some specfic questions to keep in mind when your reviewing the code:
- How well would you say I documented my code? If you were completely new and just happened to stumble upon my project, would you be able to understand and use it?
- When looking over my method for using regex to tokenize strings, how well would you say it works? Are their some inefficiencies that could be improved?
- From an object-oriented stand point, how usable would you say my code is? Would it be awkward to use?
1 Answer 1
- The decorators tie the decorated function to an instance of
Lexer
, preventing reuse of the function. - It is often necessary to try the different patterns in a particular order. The order is now lost as
self._rules
is a plaindict
. Instead of
lexer.setup('1 + 2') for token in lexer:
I would find it more natural and convenient to use something like
for token in lexer.lex('1 + 2'):
-
\$\begingroup\$ Thanks the for review Janne, you made some good points. However, I'm a little confused about your first point. You said when someone uses the decorators on a function, they "tie" that function to a lexer instance, preventing reuse. However, when someone decoratorates a specific function, I expect that function to only be used on that specific lexer instance. Do you mind elaborating a bit? \$\endgroup\$Chris– Chris2017年06月04日 20:15:16 +00:00Commented Jun 4, 2017 at 20:15
-
\$\begingroup\$ @ChristianDean Suppose I want multiple instances of the same lexer -- I would have to
def
the same functions for each instance. \$\endgroup\$Janne Karila– Janne Karila2017年06月05日 07:34:11 +00:00Commented Jun 5, 2017 at 7:34