How could this indentation-based lexer be simplified?

Question 1

I finally got this lexer working, full implementation here. How can the lex function be simplified? The goal is to take a tree of indentation-based text, and convert it into parentheses-based opening/closing text, for easier parsing down the road. But the logic is pretty messy and I'm not sure how to make it elegant.

const patterns = [
 [/^[a-z][a-z0-9\-\/\[\]]*/, 'nest', true],
 [/^\(/, 'open-paren'],
 [/^ /, 'open-paren', null, true],
 [/^\)/, 'close-paren'],
 [/^</, 'open-text'],
 [/^>/, 'close-text'],
 [/^\{/, 'open-term'],
 [/^\}/, 'close-term'],
 [/^\[/, 'open-nest'],
 [/^\]/, 'close-nest'],
 [/^[@\/\.][^\s]*/, 'text', true],
 [/^, */, 'slot'],
 [/^-?\d+\.\d+/, 'line', true],
 [/^-?\d+/, 'size', true],
 [/^#\w+/, 'code', true],
]
const stringPatterns = [
 [/^\{/, 'open-term'],
 [/^(?:\\[<>\{\}])+/, 'text', true, null, t => t.replace(/\\/g, '')],
 [/^[^\{>]+/, 'text', true],
 [/^>/, 'close-text'],
]
function lex(str) {
 const tokens = []
 // tokens.push({ form: 'nest', text: 'start' })
 // tokens.push({ form: 'open-paren' })
 let indents = [0]
 let nesting = 0
 let matched = false
 let isString = false
 while (str.length) {
 if (str[0] == '\n') {
 str = str.substr(1)
 while (true) {
 let match = str.match(/^ *\n/)
 if (match) {
 str = str.substr(match[0].length)
 } else {
 break
 }
 }
 if (str.match(/^ *$/)) {
 while (nesting > 1) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 while (indents.length) {
 tokens.push({
 form: 'close-paren'
 })
 indents.pop()
 }
 break
 }
 if (!matched) {
 continue
 }
 let match = str.match(/^ +/)
 let newIndent = match ? match[0].length : 0
 if (match) str = str.substr(match[0].length)
 if (newIndent % 2 != 0) throw new Error('Indentation error')
 let oldIndent = indents[indents.length - 1]
 if (newIndent === oldIndent) {
 while (nesting) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 // foo bar
 // foo bar
 // foo bar
 tokens.push({
 form: 'slot'
 })
 } else if (newIndent > oldIndent) {
 while (nesting > 1) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 if (newIndent - oldIndent != 2) throw new Error('Indentation error')
 // foo bar
 // foo bar baz
 // foo bar
 tokens.push({
 form: 'slot'
 })
 indents.push(newIndent)
 nesting = 0
 } else {
 if (Math.abs(newIndent - oldIndent) % 2 != 0) throw new Error('Indentation error')
 while (nesting) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 let diff = (oldIndent - newIndent) / 2
 while (diff) {
 tokens.push({
 form: 'close-paren'
 })
 diff--
 indents.pop()
 }
 tokens.push({
 form: 'slot'
 })
 // indents.push(newIndent)
 }
 } else {
 let p = isString ? stringPatterns : patterns
 x:
 for (let pattern of p) {
 let match = str.match(pattern[0])
 if (match) {
 matched = true
 let text = match[0]
 let attrs = {
 form: pattern[1]
 }
 if (pattern[1] === 'open-text') {
 isString = true
 }
 if (pattern[1] === 'close-text' || pattern[1] === 'open-term') {
 isString = false
 } else if (pattern[1] === 'close-term') {
 isString = true
 }
 if (pattern[3]) {
 nesting++
 }
 if (pattern[2]) attrs.text = pattern[4] ? pattern[4](text) : text
 tokens.push(attrs)
 str = str.substr(text.length)
 break x
 }
 }
 }
 }
 while (nesting > 1) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 while (indents.length) {
 tokens.push({
 form: 'close-paren'
 })
 indents.pop()
 }
 // tokens.push({ form: 'close-paren' })
 return tokens
}

Is there anything repetitive which could be abstracted out or simplified? Or somehow make that first branch clean?

Also would be good to know if there is a way to do this without doing str = str.substr(...), which seems like a performance problem, but that might be tangential to the main question.

Basically wondering how this could be streamlined and simplified so it could be more along the lines of complexity of the original-ish JSON parser, which just does basic operations. Meanwhile, this has a lot of extra code to deal with the nesting.

An example following the readme will take this:

link a/b[c/d][e]/f[g[h[i/j]]]
 text <foo {bar <hello {random}>, <world>} baz>
 another bar
link foo/bar, foo/baz, a/b/c
link <foo>, 123, #u123, 3.14

And the lexer will output:

[
 { form: 'nest', text: 'link' },
 { form: 'open-paren' },
 { form: 'nest', text: 'a/b[c/d][e]/f[g[h[i/j]]]' },
 { form: 'slot' },
 { form: 'nest', text: 'text' },
 { form: 'open-paren' },
 { form: 'open-text' },
 { form: 'text', text: 'foo ' },
 { form: 'open-term' },
 { form: 'nest', text: 'bar' },
 { form: 'open-paren' },
 { form: 'open-text' },
 { form: 'text', text: 'hello ' },
 { form: 'open-term' },
 { form: 'nest', text: 'random' },
 { form: 'close-term' },
 { form: 'close-text' },
 { form: 'slot' },
 { form: 'open-text' },
 { form: 'text', text: 'world' },
 { form: 'close-text' },
 { form: 'close-term' },
 { form: 'text', text: ' baz' },
 { form: 'close-text' },
 { form: 'close-paren' },
 { form: 'close-paren' },
 { form: 'slot' },
 { form: 'nest', text: 'another' },
 { form: 'open-paren' },
 { form: 'nest', text: 'bar' },
 { form: 'close-paren' },
 { form: 'close-paren' },
 { form: 'slot' },
 { form: 'nest', text: 'link' },
 { form: 'open-paren' },
 { form: 'nest', text: 'foo/bar' },
 { form: 'slot' },
 { form: 'nest', text: 'foo/baz' },
 { form: 'slot' },
 { form: 'nest', text: 'a/b/c' },
 { form: 'close-paren' },
 { form: 'slot' },
 { form: 'nest', text: 'link' },
 { form: 'open-paren' },
 { form: 'open-text' },
 { form: 'text', text: 'foo' },
 { form: 'close-text' },
 { form: 'slot' },
 { form: 'size', text: '123' },
 { form: 'slot' },
 { form: 'code', text: '#u123' },
 { form: 'slot' },
 { form: 'line', text: '3.14' },
 { form: 'close-paren' }
]

This will then be fed to a parser for further compilation. But this question is only about how to improve on the lexer.

Grammar

I don't have an actual BNF grammar yet, but here is a brief explanation. "Terms" are words separated by hyphens like foo or foo-bar or foo-bar-baz, all lowercase letters or numbers (can't start with a number). Text (strings) are like <hello I'm some text>, with optional templating like <I am a {term}>. Terms can be used like functions too, so you can do <I am a {function-term(arg1)}>. You can include the angle brackets in text with backslash escaping \< or \>. Then there are integers and floats and "codes" like #x000000 where # is the prefix, x means hex, and then the code. There is binary codes b and octal o, and unicode u. Then the main part is the indentation. Terms are nested into a tree based on 2-space indentation. There can be multiple blank lines between nested terms. The indentation serves the same function/purpose as parentheses, they result in the same AST. So this:

i am
 a tree

Is the same as i(am, a(tree)). Finally, you can have array-like/hash-like syntax like the example above, with link a/b[c/d][e]/f[g[h[i/j]]]. That "nest" is always a leaf node. See the parser in the repo on how that is handled, it is just consumed as one block in the lexer. That is pretty much it. The lexer basically processes/handles the indentation and the terms, giving to the parser what can be treated as a tree, so the parser doesn't have to figure out the indentation logic too.

I mainly want to know how to simplify handling the indentation logic. Here's a real-world example of some of the code.

enter image description here

Question 2

I have very little knowledge about lexers, but can it be, that this lexer is doing too much semantic interpretation that should belong in the parser? Aside from that I have a lot difficulty matching the input to the output. Could you explain the grammar?

Question 3

The linked readme above explains the expected parser output, but an example of the input text is here. It is basically a tree of terms with leafs being numbers or strings, similar to coffeescript in some ways. Parentheses can be used in place of indentation, but indentation is preferred. Even if the parser comes into the picture, the main question is how to simplify handling the indentation stuff. I don't have an actual BNF-syntax grammar yet.

Question 4

I added a description of the grammar, let me know if that helps.

Question 5

Questions

Let's look at the second question first:

Or somehow make that first branch clean?

It appears that you already started to analyze that lex function. At 137 lines it is a massive function to read through, let alone try updating. Think of anyone reading the code, including your future self. It would be wise to break that function up into smaller functions. This will also help with testing.

One option would be to use OOP - even if only one instance of an object is needed that object could hold the state in properties instead of referencing variables in a large function. Then that monstrous lex function can be broken up into smaller methods that use properties to update the state.

Is there anything repetitive which could be abstracted out or simplified?

Yes there are quite a few repetitive blocks. For example, the following two blocks appear both within the block starting on line 45 (i.e. if (str.match(/^ *$/)) {) as well as just before the return statement:

 while (nesting > 1) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 while (indents.length) {
 tokens.push({
 form: 'close-paren'
 })
 indents.pop()
 }

Those could be re-written as for loops. For example:

 for (; nesting > 1; nesting--) {
 tokens.push({
 form: 'close-paren'
 })
 }
 for (; indents.length; indents.pop()) {
 tokens.push({
 form: 'close-paren'
 })
 }

Additionally, the following three lines appear eight times:

tokens.push({
 form: 'close-paren'
})

Making a separate function/method for that seems excessive (since it would lead to additional function calls just to call the push method on an array). It could be declared once before the function:

const closeParenToken = {
 form: 'close-paren'
};

And then referenced whenever such a token needs to be pushed into the tokens. If having the same object copied each time is an issue then various techniques could be used - e.g. spreading into a new object - {...closeParenToken}

Other review points

Most variables are declared using let. Some could be could be declared with const - e.g. indents, match on line 41, match on line 62, newIndent, oldIndent, etc. It is wise to default to using const, unless you know re-assignment is necessary - then use let. This will help avoid accidental re-assignment and other bugs.

Question 6

These are great points, thanks, why didn't I think of that lol. What about the ugliness of some of the code, could it be simplified or removed in some way, like the hardcoded checks like if (pattern[1] === 'open-text') { or the indentation related code like each block around indentation? I guess I just don't like with how this function feels haha. It doesn't feel generic/standard, it feels like everything is hardcoded. But I will start by abstracting out some of the repetitive stuff, and moving to const where I can.

Question 7

Have you learned about or considered using destructuring assignment? That might allow something other than pattern[1]... e.g. if the loop would be updated from for (let pattern of p) { to something like for (let [pattern, token, setText, nest, textFunction] of p) {

score 3 · Answer 1 · 2022-02-10 00:27:19Z

Questions

Let's look at the second question first:

Or somehow make that first branch clean?

It appears that you already started to analyze that lex function. At 137 lines it is a massive function to read through, let alone try updating. Think of anyone reading the code, including your future self. It would be wise to break that function up into smaller functions. This will also help with testing.

One option would be to use OOP - even if only one instance of an object is needed that object could hold the state in properties instead of referencing variables in a large function. Then that monstrous lex function can be broken up into smaller methods that use properties to update the state.

Is there anything repetitive which could be abstracted out or simplified?

Yes there are quite a few repetitive blocks. For example, the following two blocks appear both within the block starting on line 45 (i.e. if (str.match(/^ *$/)) {) as well as just before the return statement:

 while (nesting > 1) {
 tokens.push({
 form: 'close-paren'
 })
 nesting--
 }
 while (indents.length) {
 tokens.push({
 form: 'close-paren'
 })
 indents.pop()
 }

Those could be re-written as for loops. For example:

 for (; nesting > 1; nesting--) {
 tokens.push({
 form: 'close-paren'
 })
 }
 for (; indents.length; indents.pop()) {
 tokens.push({
 form: 'close-paren'
 })
 }

Additionally, the following three lines appear eight times:

tokens.push({
 form: 'close-paren'
})

Making a separate function/method for that seems excessive (since it would lead to additional function calls just to call the push method on an array). It could be declared once before the function:

const closeParenToken = {
 form: 'close-paren'
};

And then referenced whenever such a token needs to be pushed into the tokens. If having the same object copied each time is an issue then various techniques could be used - e.g. spreading into a new object - {...closeParenToken}

Other review points

Most variables are declared using let. Some could be could be declared with const - e.g. indents, match on line 41, match on line 62, newIndent, oldIndent, etc. It is wise to default to using const, unless you know re-assignment is necessary - then use let. This will help avoid accidental re-assignment and other bugs.

These are great points, thanks, why didn't I think of that lol. What about the ugliness of some of the code, could it be simplified or removed in some way, like the hardcoded checks like if (pattern[1] === 'open-text') { or the indentation related code like each block around indentation? I guess I just don't like with how this function feels haha. It doesn't feel generic/standard, it feels like everything is hardcoded. But I will start by abstracting out some of the repetitive stuff, and moving to const where I can.
Have you learned about or considered using destructuring assignment? That might allow something other than pattern[1]... e.g. if the loop would be updated from for (let pattern of p) { to something like for (let [pattern, token, setText, nest, textFunction] of p) {

Stack Exchange Network

How could this indentation-based lexer be simplified?

Grammar

1 Answer 1

Questions

Other review points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How could this indentation-based lexer be simplified?

Grammar

1 Answer 1

Questions

Other review points

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions