I finally got this lexer working, full implementation here. How can the lex
function be simplified? The goal is to take a tree of indentation-based text, and convert it into parentheses-based opening/closing text, for easier parsing down the road. But the logic is pretty messy and I'm not sure how to make it elegant.
const patterns = [
[/^[a-z][a-z0-9\-\/\[\]]*/, 'nest', true],
[/^\(/, 'open-paren'],
[/^ /, 'open-paren', null, true],
[/^\)/, 'close-paren'],
[/^</, 'open-text'],
[/^>/, 'close-text'],
[/^\{/, 'open-term'],
[/^\}/, 'close-term'],
[/^\[/, 'open-nest'],
[/^\]/, 'close-nest'],
[/^[@\/\.][^\s]*/, 'text', true],
[/^, */, 'slot'],
[/^-?\d+\.\d+/, 'line', true],
[/^-?\d+/, 'size', true],
[/^#\w+/, 'code', true],
]
const stringPatterns = [
[/^\{/, 'open-term'],
[/^(?:\\[<>\{\}])+/, 'text', true, null, t => t.replace(/\\/g, '')],
[/^[^\{>]+/, 'text', true],
[/^>/, 'close-text'],
]
function lex(str) {
const tokens = []
// tokens.push({ form: 'nest', text: 'start' })
// tokens.push({ form: 'open-paren' })
let indents = [0]
let nesting = 0
let matched = false
let isString = false
while (str.length) {
if (str[0] == '\n') {
str = str.substr(1)
while (true) {
let match = str.match(/^ *\n/)
if (match) {
str = str.substr(match[0].length)
} else {
break
}
}
if (str.match(/^ *$/)) {
while (nesting > 1) {
tokens.push({
form: 'close-paren'
})
nesting--
}
while (indents.length) {
tokens.push({
form: 'close-paren'
})
indents.pop()
}
break
}
if (!matched) {
continue
}
let match = str.match(/^ +/)
let newIndent = match ? match[0].length : 0
if (match) str = str.substr(match[0].length)
if (newIndent % 2 != 0) throw new Error('Indentation error')
let oldIndent = indents[indents.length - 1]
if (newIndent === oldIndent) {
while (nesting) {
tokens.push({
form: 'close-paren'
})
nesting--
}
// foo bar
// foo bar
// foo bar
tokens.push({
form: 'slot'
})
} else if (newIndent > oldIndent) {
while (nesting > 1) {
tokens.push({
form: 'close-paren'
})
nesting--
}
if (newIndent - oldIndent != 2) throw new Error('Indentation error')
// foo bar
// foo bar baz
// foo bar
tokens.push({
form: 'slot'
})
indents.push(newIndent)
nesting = 0
} else {
if (Math.abs(newIndent - oldIndent) % 2 != 0) throw new Error('Indentation error')
while (nesting) {
tokens.push({
form: 'close-paren'
})
nesting--
}
let diff = (oldIndent - newIndent) / 2
while (diff) {
tokens.push({
form: 'close-paren'
})
diff--
indents.pop()
}
tokens.push({
form: 'slot'
})
// indents.push(newIndent)
}
} else {
let p = isString ? stringPatterns : patterns
x:
for (let pattern of p) {
let match = str.match(pattern[0])
if (match) {
matched = true
let text = match[0]
let attrs = {
form: pattern[1]
}
if (pattern[1] === 'open-text') {
isString = true
}
if (pattern[1] === 'close-text' || pattern[1] === 'open-term') {
isString = false
} else if (pattern[1] === 'close-term') {
isString = true
}
if (pattern[3]) {
nesting++
}
if (pattern[2]) attrs.text = pattern[4] ? pattern[4](text) : text
tokens.push(attrs)
str = str.substr(text.length)
break x
}
}
}
}
while (nesting > 1) {
tokens.push({
form: 'close-paren'
})
nesting--
}
while (indents.length) {
tokens.push({
form: 'close-paren'
})
indents.pop()
}
// tokens.push({ form: 'close-paren' })
return tokens
}
Is there anything repetitive which could be abstracted out or simplified? Or somehow make that first branch clean?
Also would be good to know if there is a way to do this without doing str = str.substr(...)
, which seems like a performance problem, but that might be tangential to the main question.
Basically wondering how this could be streamlined and simplified so it could be more along the lines of complexity of the original-ish JSON parser, which just does basic operations. Meanwhile, this has a lot of extra code to deal with the nesting.
An example following the readme will take this:
link a/b[c/d][e]/f[g[h[i/j]]]
text <foo {bar <hello {random}>, <world>} baz>
another bar
link foo/bar, foo/baz, a/b/c
link <foo>, 123, #u123, 3.14
And the lexer will output:
[
{ form: 'nest', text: 'link' },
{ form: 'open-paren' },
{ form: 'nest', text: 'a/b[c/d][e]/f[g[h[i/j]]]' },
{ form: 'slot' },
{ form: 'nest', text: 'text' },
{ form: 'open-paren' },
{ form: 'open-text' },
{ form: 'text', text: 'foo ' },
{ form: 'open-term' },
{ form: 'nest', text: 'bar' },
{ form: 'open-paren' },
{ form: 'open-text' },
{ form: 'text', text: 'hello ' },
{ form: 'open-term' },
{ form: 'nest', text: 'random' },
{ form: 'close-term' },
{ form: 'close-text' },
{ form: 'slot' },
{ form: 'open-text' },
{ form: 'text', text: 'world' },
{ form: 'close-text' },
{ form: 'close-term' },
{ form: 'text', text: ' baz' },
{ form: 'close-text' },
{ form: 'close-paren' },
{ form: 'close-paren' },
{ form: 'slot' },
{ form: 'nest', text: 'another' },
{ form: 'open-paren' },
{ form: 'nest', text: 'bar' },
{ form: 'close-paren' },
{ form: 'close-paren' },
{ form: 'slot' },
{ form: 'nest', text: 'link' },
{ form: 'open-paren' },
{ form: 'nest', text: 'foo/bar' },
{ form: 'slot' },
{ form: 'nest', text: 'foo/baz' },
{ form: 'slot' },
{ form: 'nest', text: 'a/b/c' },
{ form: 'close-paren' },
{ form: 'slot' },
{ form: 'nest', text: 'link' },
{ form: 'open-paren' },
{ form: 'open-text' },
{ form: 'text', text: 'foo' },
{ form: 'close-text' },
{ form: 'slot' },
{ form: 'size', text: '123' },
{ form: 'slot' },
{ form: 'code', text: '#u123' },
{ form: 'slot' },
{ form: 'line', text: '3.14' },
{ form: 'close-paren' }
]
This will then be fed to a parser for further compilation. But this question is only about how to improve on the lexer.
Grammar
I don't have an actual BNF grammar yet, but here is a brief explanation. "Terms" are words separated by hyphens like foo
or foo-bar
or foo-bar-baz
, all lowercase letters or numbers (can't start with a number). Text (strings) are like <hello I'm some text>
, with optional templating like <I am a {term}>
. Terms can be used like functions too, so you can do <I am a {function-term(arg1)}>
. You can include the angle brackets in text with backslash escaping \<
or \>
. Then there are integers and floats and "codes" like #x000000
where #
is the prefix, x
means hex, and then the code. There is binary codes b
and octal o
, and unicode u
. Then the main part is the indentation. Terms are nested into a tree based on 2-space indentation. There can be multiple blank lines between nested terms. The indentation serves the same function/purpose as parentheses, they result in the same AST. So this:
i am
a tree
Is the same as i(am, a(tree))
. Finally, you can have array-like/hash-like syntax like the example above, with link a/b[c/d][e]/f[g[h[i/j]]]
. That "nest" is always a leaf node. See the parser in the repo on how that is handled, it is just consumed as one block in the lexer. That is pretty much it. The lexer basically processes/handles the indentation and the terms, giving to the parser what can be treated as a tree, so the parser doesn't have to figure out the indentation logic too.
I mainly want to know how to simplify handling the indentation logic. Here's a real-world example of some of the code.
-
1\$\begingroup\$ I have very little knowledge about lexers, but can it be, that this lexer is doing too much semantic interpretation that should belong in the parser? Aside from that I have a lot difficulty matching the input to the output. Could you explain the grammar? \$\endgroup\$RoToRa– RoToRa2022年02月07日 15:24:35 +00:00Commented Feb 7, 2022 at 15:24
-
\$\begingroup\$ The linked readme above explains the expected parser output, but an example of the input text is here. It is basically a tree of terms with leafs being numbers or strings, similar to coffeescript in some ways. Parentheses can be used in place of indentation, but indentation is preferred. Even if the parser comes into the picture, the main question is how to simplify handling the indentation stuff. I don't have an actual BNF-syntax grammar yet. \$\endgroup\$Lance Pollard– Lance Pollard2022年02月09日 15:03:36 +00:00Commented Feb 9, 2022 at 15:03
-
\$\begingroup\$ I added a description of the grammar, let me know if that helps. \$\endgroup\$Lance Pollard– Lance Pollard2022年02月09日 15:14:57 +00:00Commented Feb 9, 2022 at 15:14
1 Answer 1
Questions
Let's look at the second question first:
Or somehow make that first branch clean?
It appears that you already started to analyze that lex
function. At 137 lines it is a massive function to read through, let alone try updating. Think of anyone reading the code, including your future self. It would be wise to break that function up into smaller functions. This will also help with testing.
One option would be to use OOP - even if only one instance of an object is needed that object could hold the state in properties instead of referencing variables in a large function. Then that monstrous lex
function can be broken up into smaller methods that use properties to update the state.
Is there anything repetitive which could be abstracted out or simplified?
Yes there are quite a few repetitive blocks. For example, the following two blocks appear both within the block starting on line 45 (i.e. if (str.match(/^ *$/)) {
) as well as just before the return statement:
while (nesting > 1) { tokens.push({ form: 'close-paren' }) nesting-- } while (indents.length) { tokens.push({ form: 'close-paren' }) indents.pop() }
Those could be re-written as for
loops. For example:
for (; nesting > 1; nesting--) {
tokens.push({
form: 'close-paren'
})
}
for (; indents.length; indents.pop()) {
tokens.push({
form: 'close-paren'
})
}
Additionally, the following three lines appear eight times:
tokens.push({ form: 'close-paren' })
Making a separate function/method for that seems excessive (since it would lead to additional function calls just to call the push method on an array). It could be declared once before the function:
const closeParenToken = {
form: 'close-paren'
};
And then referenced whenever such a token needs to be pushed into the tokens. If having the same object copied each time is an issue then various techniques could be used - e.g. spreading into a new object - {...closeParenToken}
Other review points
Most variables are declared using let
. Some could be could be declared with const
- e.g. indents
, match
on line 41, match
on line 62, newIndent
, oldIndent
, etc. It is wise to default to using const, unless you know re-assignment is necessary - then use let. This will help avoid accidental re-assignment and other bugs.
-
\$\begingroup\$ These are great points, thanks, why didn't I think of that lol. What about the ugliness of some of the code, could it be simplified or removed in some way, like the hardcoded checks like
if (pattern[1] === 'open-text') {
or the indentation related code like each block around indentation? I guess I just don't like with how this function feels haha. It doesn't feel generic/standard, it feels like everything is hardcoded. But I will start by abstracting out some of the repetitive stuff, and moving toconst
where I can. \$\endgroup\$Lance Pollard– Lance Pollard2022年02月10日 01:20:39 +00:00Commented Feb 10, 2022 at 1:20 -
\$\begingroup\$ Have you learned about or considered using destructuring assignment? That might allow something other than
pattern[1]
... e.g. if the loop would be updated fromfor (let pattern of p) {
to something likefor (let [pattern, token, setText, nest, textFunction] of p) {
\$\endgroup\$2022年02月11日 06:38:23 +00:00Commented Feb 11, 2022 at 6:38
Explore related questions
See similar questions with these tags.