Useless Factor: parsing

Showing posts with label parsing. Show all posts

Tuesday, March 17, 2009

The implementation of Factor's regexp library

I've been working on Factor's regular expression library, initially written by Doug Coleman, for the past few weeks. Recently, the library became good enough that I've pushed it to Factor's main repository. The latest Factor binaries have this new library.

The library uses an standard algorithm of converting a regular expression into an NFA, and that into a DFA which can be executed. This is a tradeoff: the code generated will be faster than you would get from a backtracking search or an NFA interpreter, but it takes exponential time, in the worst case, to generate the DFA. I might revisit this later.

The main features missing now are

Possessive and reluctant matching
Group capture
Unicode support, in the form of UTS 18 level 1 compliance with some level 2 features

Right now, I'm working on Unicode support. Regexps already use Unicode strings, because all strings in Factor represent a sequence of Unicode code points, but not many Unicode properties are exposed now. I plan on working on this, and implementing more Unicode algorithms and properties, to reach level 1 compliance.

The rest of this article is an overview of how the regexp engine works. It is implemented as a series of passes, where the first pass takes a string as input, and the last pass outputs a word which runs the code of the regexp. In this way, it is rather like any other compiler, where the parse tree, DFA table and NFA table are just intermediate representations.

The parser

The parser is implemented with Chris Double's packrat parsing library. This makes it not very efficient, but the time spent in parsing is much less than the time in later processing stages, so the cost isn't very large. Things like /a{2,4}/ are expanded into the equivalent, but simpler, form /aa|aaa|aaaa/.

If I were working only with ASCII, then ranges would be expanded into disjunctions as well, but the alphabet is far too big for that. Instead, something like /[ac-z]/ is represented in the syntax tree as a item, a character class object, representing the information that it matches the character a, or something in the class which is the range c-z. For a character class like /[^\dc]/, an object is created which represents a character which is not a digit or c.

Constructing an NFA

From the syntax tree, a nondeterministic finite-state automaton is built. The algorithm is described here, and there is nothing special about the implementation.

Lookahead, lookbehind and anchors (like $ and ^) expand to entries in the syntax tree called tagged epsilons. When these are encountered in building the NFA, an epsilon transition is created which is annotated with this information.

Negation is implemented here. If a negation syntax node is encountered, then the NFA builder constructs an NFA for the enclosed term, disambiguates it, converts it to a DFA, minimizes it, and attaches it back to the larger NFA that is being constructed.

Disambiguation

As I previously described, since the implementation doesn't iterate over every element of the alphabet, there needs to be a procedure to make transitions over characters have disjoint labels. Transitions are labeled by sets, and the output from creating an NFA might have intersecting outgoing sets from a transition.

The best way I've thought of doing this is to get all of the intersections of all of the edge labels, basically forming a Venn diagram. This is, unfortunately, exponential time and space to do. But I see no way of avoiding it when compiling a regular expression like /\p{letter}a|[0-9a-z]b|\{script=latin}c|.../ where there are a large number of incomparable character classes used.

I implemented a small optimization for this: numbers (ie literal characters) are set aside at the beginning and treated specially, so work isn't wasted intersecting them with other classes. The complexity of the algorithm stays exponential, but instead of being exponential in the total number of character classes in the regexp, it becomes exponential in just the non-literal classes.

Constructing a DFA

This is also a standard algorithm. My only modification is to support the tagged epsilon transitions created by lookaround and anchors. I described the modification in a previous blog post.

Minimization

Next, the resulting DFA is minimized. I wrote about regexp minimization before. The algorithm had to be modified slightly to allow for the conditional transitions introduced by processing lookaround in the previous step.

Compilation to Factor

At this point, we have a nice minimal DFA with disjoint outward transitions. Translating it into Factor code is actually quite easy. For each state, we make a gensym. The gensym takes as arguments a string and an index. If the index is at the end of the string, the word returns a boolean, indicating whether the current state is an accepting state. If the index is not at the end of the string, the current character is found, and the word figures out which transition to take. A transition is taken by incrementing the index and then making a tail call to another state word.

The strategy for finding the right transition is somewhat complicated. First, the literal transitions (over constant characters) are partitioned out from the non-literal transitions. The literal transitions are formed into a case statement, where the default case handles non-literal transitions.

Non-literal transitions are all boolean expressions, built with the class algebra described below. They have a number of logic variables (classes of characters). So we can build a truth table over the logic variables, and test each condition exactly once to figure out which transition to take. For example, in the regexp /\p{script=latin}b|\p{lower}c/, the DFA will have three transitions from the start: one over characters which are Latin script and lower-cased, one over characters which are lower-cased but not in Latin script, and one over characters which are in Latin script but not lower-cased. Rather than having the compiled DFA check if the character is in the composite classes directly (which would duplicate cost, since it would be looked up multiple times if a character is lower-cased or Latin script), the compiler builds nested if statements that figure out the composite class while testing each property only once. This leads directly to the transition.

Class algebra

There is a system for simplifying the intersections built in disambiguation, as well as character class literals. It is built off simplifying logical expressions built with and, or, not. The things contained are true (the whole set of characters), false (the empty set), and all possible character classes.

There are constructors for these three logical operations, and then a few simple tactics are used to reduce them. Reducing the expression to simplest form is equivalent to circuit minimization. A friend told me that this is on the second level of the polynomial hierarchy. So I'll just live with the heuristics.

The not constructor is simple. If it's given true, it outputs false. False to true. If it's given a negation as input, it returns the contents. If it's given an and class, it uses De Morgan's law and negates each entry, returning an or. And vice versa.

The and/or constructors are slightly more complicated. I will describe how the and constructor works; the or constructor can be easily derived using De Morgan's law. The input is a sequence of classes, and we want to get their intersection. First, if the input contains intersection (and) classes, these are flattened into the larger sequence. Next, the sequence is sorted into categories: integers, negations of integers, simple classes (like the class digits), negations of those, union classes (ors), and booleans. Delete true from the booleans list, if it's there, as it cannot affect the outcome. If there is a false in the booleans list, then the answer is false. If there is more than one integer, the answer is immediately false. If there is exactly one integer, then the answer is that integer if it is contained in all of the other classes, otherwise false. Now, we are working with a sequence which does not have integer literals, or true or false. If there is a simple class and a not-simple class for the same class, we know that their intersection is false, so the entire expression is false. We can remove not-integers where the integer is contained in an existing not-simple class, as these are redundant. Finally, the or classes within the and class can be simplified in the case where they have logic variables overlapping with other things in the and class: these can all be substituted with true. For example, if you have and(lowercase, or(lowercase, latin)), this can be simplified to and(lowercase, latin). This is because true is substituted for lowercase in the or expression, and or(true, latin) simplifies to latin.

Previously, class algebra was not this strong in what reductions it did. This caused problems. For example, nested negations (useful in implementing conjunction) would result in multiple nested disambiguations, which would cause a very fast blowup in the size of the DFA. Now, running disambiguation twice gives the same results as running it once. (In other words, disambiguation is idempotent.) At least I think so.

Conclusion

Implementing regular expressions took a lot more thinking than I expected, but I'm satisfied with the result so far. Unlike traditional regular expressions, where pathologies might make regexp matching slow, in this system pathologies make regexp compilation time slow. That seems more acceptable to me. Factor's extensible syntax allows me to make regexp literals, which compile before the program runs, even though regexps are a library rather than built in.

If I were starting from scratch, I might instead use the algorithm of constructing a DFA directly from a regular expression using regexp derivatives. It's described in this paper [PDF]. I'm not sure how performance in practice compares, but it implements negation and conjunction in a much cleaner way, and basically eliminates the need for minimization. Most importantly, it allows for a variation of disambiguation which avoids exponential behavior in many cases.

In a later blog post, I'll talk about the API that I expose for regexps.

I haven't actually implemented \p{script=foo} yet, but I will very soon.

Posted by Unknown at 11:19 AM 5 comments:

Labels: factor, parsing

Thursday, March 5, 2009

Naive lookaround in a DFA

Note: If you don't understand what's below, read this for background on implementing regular expressions with DFAs.

I just implemented lookahead and lookbehind in the Factor regexp module. This lets you write regular expressions like /(?<=foo)bar/, to match "bar" when it's preceded by "foo". I couldn't think of a better way to do it for the general case, so I just did it the naive way. There should be a better way, though, because lookahead and lookbehind don't extend the set of possible strings matched. This algorithm isn't so good because it makes things worse than linear time in the general case.

For a lookahead or lookbehind clause, there is a little regular expression compiled. This regular expression annotates an epsilon transition. If I had an NFA interpreter, then the interpreter would just run the regular expression on the input string starting at the current point when it wants to know if it can use the epsilon transition. I'm using a DFA, so I needed to modify the subset construction to make this work.

What needs to be modified is the procedure to get the epsilon closure of an NFA state. Instead of returning a set of NFA states, this procedure should return a set of states and the conditions for reaching each state. Usually, there will be no conditions, but sometimes there will be a conjunction or disjunction of lookarounds. It could be the conjunction in regular expression like /(?<=foo)(?=...bar)baz/, and it could be the disjunction in a regular expression like /((?<=foo)|(?=...bar))bar/, and since I'm trying to be fully general, all of this is supported.

The epsilon closure procedure is usually something like this*. Start with your state on a list, and look at all of the epsilon transitions outwards. Add these to your list. Now, if anything on your list has more epsilon transitions outwards, add these to your list. Keep going until there's nothing to add.

With the modification: Start with your state on a list, with the associated information that there are no conditions that have to be met for it. Then, look at all of the outward epsilon transitions from all the states on your list. For each new transition that you're considering, the requirement to get from the original state to the target of the transition is to meet the requirement of the source, plus the requirement of the transition. If you already had a way to get to the target of the transition, then now there are two conditions, and either one can be used. Keep repeating this until examining epsilon transitions doesn't change anything.

Now, a little post-processing can turn the result of this into a bunch of nested conditionals, which can quickly tell you what states you can get to given which conditions are met. In the usual case, where there are no conditions, this tree is just a single leaf, a list of states. From here, transitions in the DFA go not to sets of states but to trees of conditions and states. The start state is also one of these trees.

The DFA interpreter** needs a little bit of extra logic to test the conditions and traverse the tree. The interpreter gives the condition the input string and index, and the condition can do whatever it wants with that information. I've implemented anchors (eg. $ and ^) this way. They could be lookaround, but it's easier to just implement them directly as a predicate on the string and index.

*No, I don't implement it this way, because it would be a lot of repeated work, but this expresses the idea.
** Actually, everything compiles down to Factor code, which is compiled with the optimizing compiler if you want.

Posted by Unknown at 3:45 PM 1 comment:

Labels: parsing

Monday, February 23, 2009

Regular languages with large alphabets

In implementing regular expressions, the formal theory of regular languages can be useful, especially when compiling regexps to DFAs. But the standard definition of a regular language is a little inconvenient when working with Unicode. In a standard DFA, a transition is over a single character. But for the range /[0円-\u0f423f]/, I don't really want to compile a DFA with a million transition edges!

A slightly modified formalism is useful here, where transitions in NFAs and DFAs happen over sets of letters, rather than individual letters. Then, things like ranges and character classes (eg \p{Lower}) are represented as sets which annotate transition edges.

It's perfectly straightforward to translate such a regular expression into an NFA with the standard construction, and the typical algorithm for executing an NFA by running all possible states in parallel works. For DFAs, though, there's a small complication: we have to remove ambiguity about where to go next.

Say you have the regexp /\p{InLatin}b|\p{Lower}c/ This matches strings like "ab", "ac", "Ab", "πc" but not "Ac" or "πb". The simple textbook algorithm for regular expressions would have me expand /\p{InLatin}/ out to /a|A|b|B|.../, and expand /\p{Lower}/ to /a|π|.../. This strategy would work, but the size of the resulting NFA and DFA would be gigantic.

What we actually want to do is change the two outward transitions from the start state--transitions over the sets InLatin and Lower--to transitions over InLatin ∩ Lower, InLatin - Lower and Lower - InLatin. Since these are disjoint, there's no ambiguity about which one to take. In general, for a state with n outward transitions, you have to look at 2ⁿ possibilities, since for each subset, you have to make a transition for characters which are in each of those transition groups, but not in any of the others.

Implemented naively, this would make the size of DFAs blow up. For the regexp /ab|cd/, you'd have a transition on characters that are a but not c, characters that are c but not a, and characters that are both c and a. Fortunately, it's simple to work out a system which recognizes that a - c = a, c - a = c and c ∩ a = ∅. With this in place, the resulting DFA (which didn't have any ambiguity in the first place) is just like what it would be without the system, but the DFA for ab|\p{Lower}c has two transitions from the start state: one over a, and one over lower cased letters that aren't a.

I've implemented all of this in the Factor regexp library. If you want to see the code, it's in the main repository, in the regexp branch.

PS. When you're considering transitions over sets, it's possible to consider "regular" languages over an infinite alphabet. It might be convenient to think of Unicode as infinite, since it's so large. But it's easy to prove that for any such "regular" language, there is a string homomorphism to a finite-alphabet regular language where a string is in the original language if and only if its homomorphic image is in the smaller regular language. So, in this way, it's a mathematically boring formalism to study. Other people have studied regular language formalisms with infinite alphabets that actually do have more power--they have the power to compare characters for equality in certain contexts. But that's completely different.

Posted by Unknown at 11:29 AM No comments:

Labels: parsing

Wednesday, February 18, 2009

DFA minimization

I've been working on the regexp vocabulary by Doug Coleman, cleaning it up and adding new features with the goal of making it usable for the xmode syntax highlighter. This means compatibility with most of Java's regexp syntax.

The implementation strategy Doug and I are using is standard textbook stuff, the way Lex works: from the regular expression, a nondeterministic finite automaton (NFA) is constructed, and this is converted to a deterministic finite automaton (DFA). This can be used to efficiently match the string in linear time. Russ Cox wrote a good introduction to all of this.

There's a limitation: backreferences (like /(.*)1円/) are not supported, since they're incompatible with the NFA/DFA model. But there's no good way to implement backreferences, anyway: parsing with them is NP-complete. Perl uses a backtracking model where backreferences are easy to implement, but in certain cases the backtracking gets out of hand and performance is worse than linear.

Today, I worked out the code to minimize DFAs. The DFA minimization algorithm is really nice, so I thought I would share it with you. The implementation is just 65 lines of Factor code, which is in the Factor pastebin.

The issue is that sometimes, naively constructing a DFA gives you duplicate states which are really the same thing. For example, if you have two final states which each have no outward transitions, they can be consolidated. If you have the regular expression /ac|bc/, then there's a DFA for this in just 3 states. The naive way, however, would give you 5 states.

What we want is to partition the states into sets of states that are all the same, and then use only one of these. In mathematical language, we want to create an equivalence relation and quotient out by it. Here's how we figure out what states are the same.

Assume that all states are the same*.
Separate the final states from the non-final states.
Repeat the following until it doesn't make a change in the partitioning:
- Separate any two states which have a transition with the same label to two states which are separated.

Once this doesn't change anything, the states have been divided into the sets that we want.

Interestingly, this algorithm is pretty similar to optimistic global value numbering, a compiler optimization. Optimistic value numbering is a technique for eliminating duplicated computations. It works by assuming that all registers hold the same value, and there is an iteration until fixpoint tries to separate registers which are actually different from each other. When faced with loops, this can catch more than so-called pessimistic value numbering, which first assumes that everything is different, until it can prove that two registers hold the same value.

In the simple, controlled environment of a DFA, it's been proved that this minimization actually produces the smallest possible DFA matching the same set of strings. It's even been shown that, for a given regular language, the minimal DFA recognizing it is unique up to isomorphism. Such a nice analysis isn't possible in the more complicated world of compiler optimizations, however, where better optimizations than GVN exist.

*Actually, we assume that any two states with the same labeled outward transitions are the same. For example, if a state A goes to state B on character x and state C on character y, and state D goes to E on either x or y, then we'll assume at the beginning that A and E are the same. This is a simple modification I came up with to deal with the effectively infinite alphabet of Unicode, since it would be impossible to compare the transition on each Unicode code point.

Posted by Unknown at 11:11 PM 5 comments:

Labels: factor, parsing

Sunday, May 18, 2008

Writings on regexp group capture

So, in researching regular expression group capture, I had a little bit of trouble. It turns out that some people call it "capture groups", others call it "submatch extraction" and some people call it "subexpression match". In Google, it looks like "submatch extraction" gets the most research hits, and "subexpression match" is the most broadly used.

That behind me, I'm not the first one to come up with an algorithm for group capture in regular expressions in linear time and space. (My algorithm was, basically: annotate the NFA states which lie on a group boundary, then turn this into a DFA which marks a location in the string when that state could be entered. Run this, and then run the same thing on the reverse regular expression, putting the string in backwards, and find the intersection between the possible points of group boundary. Then, get the first possible group boundary point for each one, or the last. This can be proven correct easily in the case of one boundary point: if a proposed boundary is in the set marked for the forward pass and the backward pass, then the part before the boundary matches the first part of the regexp, and the part after the boundary matches the second part.)

Actually, there's been a bit of research here over the past 20 years. I haven't read the following papers very closely (though I plan to), but for anyone interested in understanding how to process regular expressions efficiently to get a parse tree, here are a few interesting papers:

Extending Regular Expressions with Context Operators and Parse Extraction by Steven Kearns, 1991. This does something like the algorithm I was developing, but it's further thought-out

Efficiently building a parse tree from a regular expression by Danny Dubé, Marc Feeley, 2000. This goes into more depth on building parse trees, but their algorithm is apparently less efficient than the one just below.

Efficient submatch addressing for regular expressions [PDF] by Ville Laurikari, 2001. This is someone's Master dissertation, so it's easier to read and presents background information. The formal model of a tagged NFA is introduced. Benchmarks are provided, showing the system to be much faster than other widely used libraries.

Greedy Regular Expression Matching by Alain Frisch, Luca Cardell, 2004

All of these papers go about submatch extraction in somewhat difficult ways. I hope I helped someone avoid a difficult literature search like I had.

Update: It seems the best way to do a literature search is to blog about something, and have commenters give you relevant papers. Here's one by Burak Emir describing how to get the shortest match (think non-greedy, but globally optimal) with group capture, taking advantage of transformations of regexes. Thanks, Alain Frisch!

Posted by Unknown at 9:56 AM 1 comment:

Labels: parsing

Saturday, May 10, 2008

Parsing with regular expressions and group capture

Update: This idea is completely not new. See Ville Laurikari's master's thesis, Efficient Submatch Addressing for Regular Expressions, especially chapter 2.

Though I'd rather avoid it, string parsing is a crucial part of programming. Since we're more than 60 years into the use of modern computers, it seems like we should have a pretty good handle on how to build abstractions over parsing. Indeed, there are tons of great tools out there, like GNU Yacc, Haskell's Parsec, newer Packrat-based parsers like Factor's EBNF syntax for PEGs, and a bunch of other high level parsing libraries. These libraries are relatively easy to use once you understand the underlying structure (each one parses a different subset of context-free grammars), because they expose the programmer to a tree-like view of the string.

However, these incur too much overhead to be used for certain domains, like the parsing that goes on in an HTTP server or client. They're really overkill when, as in the case of HTTP interchanges, what you're dealing with is a regular language and processing can be done on-line. (I'll get back later to what I mean by those two things.) The main tools that exist to deal with this are Lex and Ragel.

Ragel seems like a really interesting solution for this domain. The entire description of parsing is eventually put in one regular expression, which is compiled to a DFA, where states and transitions can be annotated by actions. Fine-grained control is given to limit non-determinism. But the user must be acutely aware of how regular expressions correspond to DFAs in order to use the abstraction. So it is somewhat leaky. Also, it's difficult to get a tree-like view on things: actions are used purely for their side effect.

So, here's an idea: let's find a middle ground. Let's try to use regular expressions, with all of their efficiency advantages, but get an abstract tree-like view of the grammar and an ability to use parsing actions like high-level abstractions allow. Ideally, the user won't have to know about the implementation beyond two simple facts: regular languages can't use general recursion, and nondeterminism should be minimized.

This isn't something that I've implemented, but I have a pretty good idea for the design of such as system, and I wanted to share it with you. First, a little background.

DFAs and regular expressions

I'm taking a computer science class about this right now, so I'm gonna be totally pedantic. When I say regular expression, I mean an expression that describes a regular language. Perl regexes aren't regular expressions (and Larry Wall knows this). If you don't feel like putting on your theoretician's hat, this blog post will be mostly meaningless.

What's a regular language? First off, a language is a set of strings. We care about infinite sets of strings, since finite sets are trivial to represent. If a string is in the language, that means that the language matches the string, intuitively. A regular language is one which can be represented by a deterministic finite automaton (DFA) without extensions, also called a finite state machine (FSM) for some reason. Many useful languages are regular, and many are not.

The idea of a DFA is a finite number of states and a transition function, which takes the current state and a character of a string and returns the next state. The transition function is defined on all states and all characters in the alphabet. There is a set of final states, and if the string runs out when the machine is in a final state, then the string is accepted. The language of the DFA is the set of strings accepted by the DFA. For a given DFA, that DFA can be run in linear time with respect to the length of the input string and constant space. It can also be run "on-line", that is, without the whole string in advance, going incrementally.

A related construction is an NFA, or nondeterministic finite automaton. Imagine the previous idea, but instead of a transition function, there is a transition relation. That is, for any character and current state, there are zero or more next states to go to, and the NFA always picks the right one. This is called nondeterminism (at least that's what it means here). Amazingly, NFAs can accept only regular languages and nothing more, because NFAs can be translated into DFAs. Basically, you build a DFA which picks all possible states at once, given all possible paths through the NFA. Potentially, though, there's an exponential blowup in the number of states.

Every regular expression can be converted into an equivalent NFA, which can be converted into a DFA, which can then be converted back into a regular expression. They're all equivalent. So then what's a regular expression? There are different ways to define it. One is that you can build up a regular expression from the following elements: the epsilon regex (matching the empty string), the empty regex (matching nothing), single character regexes (matching just a single character), concatenation (one followed by another), disjunction (or) and the Kleene star (0 or more copies of something). Counterintuitively, it's possible to construct regexes which support negation, conjunction, lookahead and other interesting things.

The most important distinction from Perl regexes is that regular expressions cannot contain backreferences, because these are provably impossible to express in a DFA. It's impossible to parse something with backreferences in the same linear time and constant space that you get from regexes which are regular. In fact, parsing patterns with backreferences is NP-hard and not believed possible in polynomial time (with respect to the length of the input string). Since regular expressions which are regular give us such nice properties, I'm going to stick to them.

Regular expressions in practice in parsing today

The formal study of regular languages is a basically solved problem within the formalism itself: they are equivalent to DFAs, and satisfy a convenient set of properties summarized by the pumping lemmaand the Myhill-Nerode theorem. The problem is just, is the given string a member of the language? What languages are regular?

This was solved in the 1950s and 1960s, and the basic results are in most introductory compiler books. Those books use the solution to build lexers, like Lex. Lex basically takes a list of regular expressions, each associated with an action, finds one of them to match maximally with the current input, executes the associated action on the portion that matches, and then repeats with the rest of the string. This is useful to build lexers, but the programmer has very little context for things, so it's difficult to use for much else.

More recently, Ragel has been used as a way to parse more complicated things using regular expressions. Its strategy is to turn its compile-time input into one big regular expression, annotated with actions on certain states or transitions. The actions are fragments of C code, and they form the processing power of the machine. However, their behavior can get a little unintuitive if too much nondeterminism is used, so Ragel provides a bunch of tools to limit that. Also, Ragel lets you explicitly specify a DFA through transitions, which seems useful but low-level.

Group capture with regexes

One of the most useful features of Perl's regular expressions is group capture. By this, I mean how you can do something like s/^(1*)(0*)$/2ドル1ドル/ to swap ones and zeros in a string. This is different from backreferences (like the non-regular expression /(.*)1ドル/) because it's only used in subsequent code, to figure out what got matched to what part of the regex. It doesn't parse any languages which aren't regular, but it's a useful tool for processing.

Curiously, this has been ignored both by academics and DFA implementors so far. I hypothesize that it's been ignored by theorists for two reasons: (1) It's easy to confuse with backreferences, which make the language non-regular, which is totally uninteresting to theorists (2) They're not part of the formalism of regular expressions as previously expressed.

Implementors of (non-Perl-based) regular expression-based parsing mechanisms tend to avoid group capture because, in the general case, it's not fast enough and can't be done on-line. Also, as far as I can tell, it hasn't been implemented any other way than interpreting an NFA, using backtracking, and keeping track of where the parser is within the regex to determine group boundaries. This would be terrible for the domain of Lex and Ragel. By "on-line" I don't mean on the internet but rather that an algorithm that can be performed incrementally, getting little pieces (characters, say) of the input and doing computation incrementally, as the input is received, without storing the whole thing and running the algorithm all at once.

So how can we do group capture on-line? Well, in general, we can't. Consider the regular expression (1*)1 where you're trying to capture the group 1*. As the input is being processed, we don't know when we've gotten to the end of the group until the entire input is over, since if there are two more 1's, then the first one must be in the group. However, in many cases group capture can in fact be done on-line, as in (0*)(1*), where the groups captured are 0* and 1*. As the regex is processing on the string, it knows that, if there is a match, the group boundary is just before the first 1. This can be formalized as a "boundary of determinism": a point where, in the subset construction to form a DFA from an NFA gets a subset of exactly one state.

I believe this can handle most cases of group capture in practice, if the regular expression is well-written, but surely not all of them. I have an idea for how to do group capture in the few remaining circumstances, but unfortunately it takes linear space and it's not online. I'll blog about it once I have a proof of correctness.

Hierarchical parsing using group capture

Using this group capture mechanism, we can build a hierarchical parsing mechanism with actions on different things, which can be built to parse regular languages in a higher-level way. Regular expressions can't use arbitrary recursion like context-free grammars can, so the parse tree will be of fixed size, but it could still be useful. In designing this, I'm thinking specifically about making a SAX-like XML parser. It'd be awkward to write everything out as one big regular expression, but split into smaller things, each with their own little steps in processing, it could be much more elegant. My goal for syntax is something like EBNF syntax, as Chris Double's PEGs library in Factor does. Here's some future pseudocode for how it could look in parsing an XML tag, simplified. (In this code,> is used like Ragel :>>, to indicate that when the expression afterwards can be matched by the regex, it is, as soon as possible (basically).)


REG: tag
chars = "&" entity:any*> ";" [[ entity lookup-entity ]]
 | any
string = "\"" str:chars> "\"" [[ str ]]
 | "'" str:chars> "'" [[ str ]]
xml-name = name-start-char name-char*
attribute = name:xml-name ws "=" ws str:string [[ name str 2array ]]
tag = "<" ws closer?:("/"?) name:xml-name attrs:(attribute*) ws contained?:("/"?) ws ">" [[ ... ]]

Conclusion

Though I haven't implemented this yet, and probably shouldn't even be talking about it, I'm really excited about this idea. I even came up with a stupid little name with it: Hegel, both for High-level Ragel and because it represents the synthesis of the dialectic (as described by Hegel) of slower, higher-level parsing and fast low-level parsing into fast, high-level parsing of regular languages. I hope it works.

Posted by Unknown at 11:05 AM 2 comments:

Labels: parsing

Tuesday, April 29, 2008

Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:

Initialize a logic variable X as unbound
Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project

The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:

Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Posted by Unknown at 3:48 PM No comments:

Labels: factor, garbage collection, parsing, pattern matching

Monday, December 10, 2007

Multiline string literals in Factor

It's always annoyed me somewhat that Factor strings can only be on one line and that there was no mechanism for anything like "here documents" like Perl has. So I decided to write it myself. At this point, I don't really need it and have forgotten what I wanted it for, but it was still a fun exercise.

I started out thinking I should do something similar to some other languages do it: write a word, maybe called <<<, which is followed by some string which is used to delineate a multiline string literal expression. But I realized this wouldn't be the most idiomatic way in Factor. First, if you're making a multiline string literal, why would you ever have it be within a word? For constants like this, it's considered best practice to put them in their own separate words. Second, why do I need to give the option of choosing your own ending? What's wrong with just using a semicolon, like other Factor definitions?

Putting this together, I came up with the following syntax:


STRING: something
Hey, I can type all the text I want here
And it can be over multiple lines
But without the backslash-n escape sequence!
;

The final line, the semicolon, has to have no whitespace before or after it. This allows for practically any useful string to be written this way. The syntax will compile into something like this:


: something
 "Hey, I can type all the text I want here\nAnd it can be over multiple lines\nBut without the backslash-n escape sequence!" ;

With a previous parser design, multiline string literals like this were impossible, but now they can be done in 11 lines. I packaged this up and put it in my repository under extra/multiline so others can use it.

Using the internals of the parser, the first word advances the parser state one line and returns the text of the new line.


: next-line-text ( -- str )
 lexer get dup next-line line-text ;

The next two words do the bulk of the parsing. They advance the parser's current line until reaching a line consisting only of ";", and then advance the line one more time. While moving forward, a string is compiled consisting of all of the lines, interspersed with newlines.


: (parse-here) ( -- )
 next-line-text dup ";" =
 [ drop lexer get next-line ] [ % "\n" % (parse-here) ] if ;

: parse-here ( -- str )
 [ (parse-here) ] "" make 1 head*
 lexer get next-line ;

Finally, the word STRING: puts it all together, defining a new word using a string gathered with parse-here.


: STRING:
 CREATE dup reset-generic
 parse-here 1quotation define-compound ; parsing

There are downsides to having an extremely flexible syntax like Factor. Things can be less predictable when libraries can alter the fundamentals of syntax. It'd be impossible to create a fixed BNF description of Factor syntax. Additionally, the particular structure of Factor sometimes encourages syntax extension that's heavily dependent on the details of the current implementation. But the upside is that we can do things like this. I think it's worth it.

Posted by Unknown at 10:40 PM No comments:

Labels: factor, macros, parsing

Sunday, September 23, 2007

Using bigums efficiently

The other day, Doug Coleman got on the #concatenative IRC channel complaining of a horrible bug: when he put a 100,000 digit prime number in a file, and then tried to load the file, Factor hangs. Doug speculated that this was a compiler bug, but I had another idea: the parser wasn't processing bignums efficiently. First, a little background. This article presumes some basic knowledge of computational complexity and big-O notation, which you should read up on, if you don't know about already.

Bignums and performance

A 'bignum' is Factor's term (grabbed from Lisp terminology) for an arbitrary size integer bigger than the standard integer. Integers which do fit in a word (actually, a word minus 3 bits for the header) are called 'fixnums'. On any platform that Factor currently supports, you can count on the fact that a number smaller than 2²⁹ will be a fixnum, and a number bigger than 2⁶¹-1 will be a bignum.

In most situations, this is an irrelevant implementation detail. In Factor, bignums are used with the same functions as fixnums (and all other builtin numeric types). But there is a subtle performance issue. On fixnums, it takes (basically) constant time--O(1)--to do (basically) any simple math operation. This includes addition, multiplication, division, exponentiation, square roots, etc: all of these operations take basically the same amount of time on any fixnum. We can make this claim because all numbers are fairly small, and there's a short maximum bound on the time they take, even if it varies a little bit. In designing algorithms, programmers take advantage of this frequently.

However, with bignums, math operations take O(n) or more time, where n is the number of digits (bits) in the larger number. If you have two integers of arbitrary length, the only thing you can do to add them is, basically, the addition algorithm you learned in first grade, iterating through the string from least significant to most significant bit. The best possible time for this kind of iteration is proportional to the number of bits--O(n). Multiplication, division and other operations take even more time. For purposes of analysis, let's say multiplication is O(n*log n) where n is the number of digits in the bigger number, and exponentiation is O(d log d), where d is the number of digits in the result. (These are slightly faster than the real times, but give us a good enough estimate while leaving the math mostly simple.)

To be efficient in processing bignums, this additional time for processing must be taken into account. It's very easy to write something which works instantly on fixnums but hangs almost indefinitely on large enough bignums, but there is usually a better way.

string>number

So, I suspected that Doug's code was slow because of a naive implementation of string>number, one which is not optimized for bignums. Looking recursively through the code, I can see that the conversion from numbers takes place in the word digit>integer:


: string>number ( str -- n/f ) 10 base> ;

: base> ( str radix -- n/f )
 {
 { [ 47 pick member? ] [ string>ratio ] }
 { [ 46 pick member? ] [ drop string>float ] }
 { [ t ] [ string>integer ] }
 } cond ;

: string>integer ( str radix -- n/f )
 swap "-" ?head
>r string>digits 2dup valid-digits?
 [ digits>integer r> [ neg ] when ] [ r> 3drop f ] if ;

: digits>integer ( radix seq -- n )
 0 rot [ swapd * + ] curry reduce ;

Basically, what this does is, for each item in the given sequence, an accumulator (starting at 0) is multiplied by the radix, and then the item is added to the accumulator. An example invocation of digits>integer, which returns the number 1234:


10 { 1 2 3 4 } digits>integer

Let's look at the computational complexity of running digits>integer, in a world where only fixnums exist. In this world, * and + run in constant time. Running digits>integer with a d-digit number will do d additions and d multiplications, for a total of d*(O(1)+O(1)) = O(d) time.

O(d) time is optimal, since the string's length is d in the first place, and we need to iterate through all of its characters.

But, if we assume that all arithmetic takes place in bignums, the calculation gets a little more complicated, and the time a bit worse. All together, O((d(d+1))/2) = O(d²) time is spent in addition, and O(((d log(d))(d log(d)+1))/2) = O((d²log(d)²) time is spent in multiplication. The latter dominates the time, so the total complexity is O((d²log(d)²). This is even worse than quadratic! There must be a better way.

Minimizing intermediate bignums

The problem here is that the numbers that the intermediate bignums are too big. In parsing "1234", the accumulator first contains 0, then 1, then 12, then 123 and finally 1234. So the sum of the intermediate number lengths is d(d+1)/2 = O(d²).

But here's another method: split the string in two equal parts, parse each of them individually, then combine the results together. To combine the results together, the first string has to be shifted left by the length of the second string (using the appropriate radix!). You can write base cases for strings of length 0 and 1, which shouldn't be split. (The value of _ { } digit>integer is 0 and _ { n } digit>integer is n.)

An example: to do 10 { 1 2 3 4 } digit>integer splits into 10 { 1 2 } digits>integer and 10 { 3 4 } digits>integer. By induction, let's assume that those intermediate calculations produce 12 and 34. Now, the value 12 must be multiplied by 10²=100, since { 3 4 } is two digits long. Now, add 1200 and 34, and you get 34!

The analysis for this is almost identical to that of mergesort or quicksort. For a string holding a 8-digit number, there are four main steps: the step processing the individual numbers (really, 8 steps of constant time), then the step combining two numbers of 1 digit each (4 steps of 2x time), then the step combining two of those numbers, 2 digits each (2 steps of 4x time), and the final step of adding the two four-digit numbers together (1 step of 8x time). If you generalize it, there's a total of log₂(d)+1 steps of time O(d), yielding a total of O(d log d).

Actually...

It's a little bit more complicated than that. O(d log d) is something like the complexity for summing a list, resulting in a bignum. But I ignored the other, more expensive operation: the left shift. A base two shift would be O(s+d), where s is the amount shifted over, and d is the number of digits in the thing being shifted. With a base two shift, the complexity O(d log d) would still be valid.

But this shift has an arbitrary radix (usually 10). This is done by calculating the radix raised to the power of the shift, and then multiplying that by the number which needs to be shifted. This takes a bit more time. Counting up the time taken for multiplication and exponentiation in the same manner as addition, we get a total time of O(d*log d*log (d log d)).

Implementation

In our implementation of this function, it'd be the best if we could just go into math.parser, the vocabulary that defines digit>integer, and redefine just that word. This redefinition would be propagated to all the words that use it, all the way up to the Factor parser. Fortunately, Factor explicitly allows this kind of invasion. Just make sure that, after loading the code, everything is recompiled! Otherwise, the change might not be propagated. Here's the code you need:


USING: math kernel math.functions sequences combinators ;
IN: digits

: partition ( seq -- front end )
 [ length 2 /i ] keep cut ;

IN: math.parser

DEFER: digits>integer

: split-parse ( radix seq -- n )
 partition [
 rot [ swap digits>integer ] curry 2apply
 ] 3keep nip
 length ^ swapd * + ;

: digits>integer ( radix seq -- n )
 dup length {
 { 0 [ 2drop 0 ] }
 { 1 [ nip first ] }
 [ drop split-parse ]
 } case ;

Loading this code makes parsing large bignums dramatically faster, though smaller numbers are a little bit slower. The easiest way to load the code is to put it in path extra/digits/digits.factor, and then run the line USE: digits in the listener.

So remember, boys and girls

Whenever doing mathematical calculations that might involve bignums, it's always important to remember the computational complexity of various mathematical operations. If you forget them, a very doable problem can suddenly become intractable.

A technical note about complexity: (for the nit-picky readers among you)

In truth, the best known algorithm for bignum multiplication takes O(n log(n) log(log(n))) time, using fast Fourier transforms, which I don't yet understand. (Well, actually there's one of time O(n log(n) 2^(log*(n))), which is even faster, but no one uses that yet.) Therefore, exponentiation should take O(d log(d) log(log(d))) time, where d is the size of the final result. This is because the algorithm's time is dominated by the final doubling.

I felt that it was appropriate to use O(d log(d)) as an approximation of O(d log(d) log(log(d))), since the double log function grows very slowly, and it clutters up the math with no tangible result. For this analysis, it doesn't hurt anyone to elide it. If I were publishing this in some respectable academic thing (hah! as if that makes sense!), I'd change it at the expense of clarity.

Posted by Unknown at 9:04 PM 2 comments:

Labels: factor, math, parsing

Sunday, July 15, 2007

Messing around at the Factor REPL

Read Eval Print Loops, or REPLs, are really useful, I've found. One of my favorite uses, beyond prototyping, is playing around and solving problems. Often these are math problems, but for slightly more complicated things, I need string processing. Simple string processing things like this also make an easy example for explaining Factor to beginners like you. (If you're not a Factor beginner, you may want to stop reading.)

My younger brother is learning how to type right now. My dad found Tux Typing, a GPL typing program for Windows and Linux. So far, it's great; there's just one problem: when you use the word mode (where you have to type words before they fall to the bottom of the screen, creating horrible crashing sounds) there are only about 35 words in the long word setting. My dad asked me to fix this, and since it was a simple little task, I agreed.

I started by figuring out the file format for the dictionaries. It was easy: the first line was the title (I chose "Huge words") and after that, the words were listed in allcaps, separated by Unix line endings. Next, I copied the text of the Wikipedia entry Economy of the United States into Wordpad and saved it to the desktop. Then I downloaded Factor to the computer I was working on and fired up the REPL.

The first thing in making the word list is getting a list of space-separated things. So I made a file reader object and got an array of all the lines. I joined these lines with a space, then split everything separated by a space (separating both lines and words on the same line).

"article.txt" <file-reader> lines " " join " " split

Now an array of words is lying on top of the stack. There are a bunch of operations we need to do to manipulate this, and Factor's sequence combinators help make it easier. So I made sure that each word had at least three letters in it:

[ length 3>= ] subset

And I put all the words in upper case:

[>upper ] map

And I made sure that each character of each word was an upper case letter, to filter out apostrophes, numbers, and other similar things:

[ [ LETTER? ] all? ] subset

And finally, I made sure there were no duplicates:

prune

So, to join this together in the correct file format and add a title, I used

"Huge words" add* "\n" join

yielding a string. To write this string back to the original file, all I needed was

"article.txt" <file-writer> [ print ] with-stream

And that's it! All in all, 10-15 minutes work and I got the game working with a good word list. But the REPL made it a lot easier.

Update: I didn't have the <file-reader> and <file-writer> displayed properly before, but it's fixed now. Thanks Sam!

Posted by Unknown at 5:48 PM 1 comment:

Labels: factor, introduction, parsing