Useless Factor: pattern matching

Showing posts with label pattern matching. Show all posts

Tuesday, February 24, 2009

Draft paper about inverse

I recently submitted an abstract about inverse to an undergraduate computer science conference, and I just got word that it was accepted. I've written a draft paper, available on factorcode.org, and I'd like to hear your comments on it. I've written it for a completely general audience, not assuming knowledge of functional programming or stack-based programming languages. I didn't get into formal semantics or anything like that, partly because it'd make the paper too long and partly because I don't completely understand the point.

Speaking of undergraduate research, if you're interested in doing research in practical computer science this summer, I recommend applying to Harvey Mudd's CS REU. Applications are due Sunday. But only if you're a US citizen or permanent resident! The funding agencies don't like foreigners. (This is actually a serious problem for several of my friends, who have difficulty finding the same summer academic and internship opportunities because they are actively discriminated against as international students. I hope that, one day, discrimination based on citizenship is as illegal as discrimination against racial minorities or women, but we are very far away from that today. Now I'm getting off-topic.)

Posted by Unknown at 8:37 PM 5 comments:

Labels: factor, pattern matching, theory

Tuesday, April 29, 2008

Potential ideas to explore

I haven't written in a while, and it's a little hard to get started back up, so here are just a bunch of random ideas in my head that I'd like to share with you guys. Sorry if it's a little incoherent...

Possible extensions to Inverse

I've been thinking about possible ways to generalize my system for concatenative pattern matching, currently in extra/inverse. There are two ways to go about it: making a more general constraint solving system, and giving access to the old input when inverting something, as in the Harmony project. A third way is to add backtracking (in a different place than constraint solving would put it). To someone familiar with Inverse, these might seem like they're coming from nowhere, but they're actually very closely related. (To someone not familiar with it, see my previous blog post describing Inverse.)

Constraint solving

The idea of resolving constraints is to figure out as much as you can about a situation given certain facts. This is easy in some cases, but impossible in others, even if enough facts are known to, potentially, figure out what everything is. For example, Diophantine equations can be solved by a fully general constraint-solving system, but they're known to be undecidable in general.

So what can constraint solving get you in Inverse? Well, imagine an inverse to bi. It's not difficult to make one within the current framework, but some information is lost: everything must be completely determined. Think about inverting [ first ] [ second ] bi. Inverting this should get the same result as first2 (which has a hard-coded inverse right now, inverting to 2array). But it won't work.

A way for [ first ] [ second ] bi to work would be using the following steps:

Initialize a logic variable X as unbound
Unify X with the information, "the first element is what's second from the top of the stack (at runtime)". Now it's known that X is a sequence of length at least 1.
Unify X with the information, "the second element is what's on the top of the stack (at runtime)". Now it's know that X is a sequence of length at least two.
From the information we have about X, produce a canonical representation, since the inverted quotation is over: an array of the minimum possible length.

This isn't easy to do in general, but it should be possible, in theory. It'd be extremely cool if it worked out.

Formally, you can think of Inverse as already a reasonable constraint solving system, for a limited problem domain. Given [ f ], and the statement about stacks A and B that f(A) = B, and given B, find a possible value for A. The strategy used right now is mathematically sound, and I hope to write it up some day. But, a more general use of logic variables is possible: explicit logic variables in code. This could be used to make a better-integrated logic language in Factor.

The Harmony Project

The Harmony Project, led by Benjamin C. Pierce, is an attempt to solve the "view-update problem" using a new programming language and type system which is largely invertible. The view-update problem is that we want to convert different storage formats into an abstract representation, manipulate that representation and put it back without duplicating code about the representation. Everything operates on edge-labeled trees.

Within the Harmony framework, it's possible to do all your work in bijections (one-to-one onto functions, similar but not identical to the domain of Inverse right now), but there's extra power included: the function to put the abstract representation back into the original form has access to the original. This adds a huge amount of power, giving the possibility of conditionals and recursion, in limited cases. Also, it gives the power to ignore certain things about the surface structure when looking at the abstract form. (Harmony also has ideas about tree merging, and of course a new type system, but I'm not as interested in that right now.)

So far, only relatively trivial things have been made with Harmony, but the idea looks really useful, though there are two problems: (1) I don't really understand it fully (like constraints) and (2) I have no idea how it can fit together with Inverse as it is right now.

Backtracking

In Mark Tullsen's paper on first-class patterns, there was an interesting idea that Inverse could adopt. Tullsen used monads to sequence the patterns. It's the simplest to use the Maybe monad, and that corresponds to how pattern matching systems normally work. But if the List monad is used instead, then you easily get backtracking. This could be ported to Factor either by using monads or, maybe easier, by using continuations. Years ago, Chris Double implemented amb in Factor using continuations, though the code won't work anymore. The sequencing and backtracking I'm talking about is relevant in things like switch statements, rather than undo itself. I'm not sure if it'd actually be useful in practice.

Garbage collection research ideas

Because the summer's coming up, and I'll be participating in Harvey Mudd's Garbage Collection REU, I've been coming up with a few research ideas. The suggested one is to continue with the work of previous years' REUs and think about simplifiers and collecting certain persistent data structures and weak hashtables, but here are a couple more:

Figure out how efficient garbage collection on Non-Uniform Memory Access systems can work. The problem (if it is a problem) is that plain old garbage collection on multiprocessor NUMA systems isn't as fast as it could be, because a chunk of memory allocated for a thread may be far away from where it's used. One way to ensure locality is to give each processor (at least) its own heap, where the heap is guaranteed to be stored in the closest memory. But if data needs to be shared between processors, this can be too limiting. A piece of data can be kept on the RAM closest the processor which made the allocating call, but maybe it'd be beneficial to collect data on which processor is using which data, and dynamically move data around to different places in RAM to put it closest to where it's used. A related issue is maximizing locality when actually performing the tracing in the GC, which I have no ideas about.

Run a real benchmark comparing several GC algorithms. Probably the most annoying thing for programming language implementors trying to pick a good GC algorithm is that there's no comprehensive benchmark to refer to. No one really knows which algorithm is the fastest, so there are two strategies remaining: pick the one that sounds the fastest, or do trial and error among just a few. Each paper about a new algorithm reports speed improvements—over significantly older algorithms. It'd be a big project, but I think it's possible to make a good benchmark suite and test how long it takes for these algorithms to run, in terms of absolute throughput and pause length and frequency, given different allocation strategies. If it's possible, it'd be nice to know what kind of GC performs best given a particular memory use pattern.

Garbage collector implementation in proof-carrying code. There are a couple invariants that garbage collectors have, that must be preserved. For example, the user can't be exposed to any forwarding pointers, and a new garbage collection can't be started when forwarding pointers exist. The idea of proof-carrying code (an explicit proof, which is type-checked to be accurate, is given with the code) isn't new; it's mostly been used to prove memory consistency safety given untrusted code. But maybe it could be used to prove that a GC implementation is correct.

These ideas are really difficult, but I think they're interesting, and with four other smart people working with me, maybe in a summer we can do something really cool, like this or whatever other idea they come up with.

Ragel-style state machines in Factor

In my Automata and Computability class at Carleton, we've been studying (what else) finite automata, and it got me thinking about regular expressions and their utility in Factor. By regular expression, I mean an expression denoting a regular language: a real, academic regexp. A regular language is one that can be written as a deterministic finite automaton (finite state machine). Hopefully, I'll explain more about this in a future blog post.

Anyway, if you've heard of Ragel, it's basically what I want to do. But the form it'd take is basically the same as PEGs (Chris Double's Pacrat parser), with the one restriction that no recursion is allowed. In return for this restriction, there is no linear space overhead. Basically everything else, as far as I know, could stay the same.

I'm thinking I'll redo the XML parser with this. The SAX-like view will be done with this regular languages parser (since all that's needed is a tokenizer), and then that can be formed into a tree using PEGs (since linear space overhead is acceptable there). Linear space overhead, by the way, is unacceptable for the SAX-like view, since it should be usable for extremely large documents that couldn't easily fit in memory all at once.

(By the way, I know Ragel also allows you to explicitly make state charts, but I won't include this until I see a place where I want to use it.)

Posted by Unknown at 3:48 PM No comments:

Labels: factor, garbage collection, parsing, pattern matching

Friday, July 27, 2007

Prior art

A few weeks ago, I outlined an implementation of pattern matching in Factor. But just recently, I found a paper with a remarkably similar pattern matching system in Haskell. It's entitled "First Class Patterns" by Mark Tullsen, published in 2000 with a grant from the NSF.

On the surface, it looks completely different, but underneath, it's the same. This Haskell pattern matching system defines inverses for constructor words, with the versions Constructor for the normal constructor and Constructor# for the inverse. The inverse is failable, and can be represented as a -> Maybe b. Just x indicates a successful inverse, with the data which it yields, and Nothing indicates failure. This is analogous to what I did with exceptions. You can define your own inverse for functions as well as constructors, much like what I did with define-inverse.

Tullsen did something really cool when he generalized pattern matching to MonadPlus m => a -> m b, allowing more complex things like backtracking using the list monad. You could also give your own semantics for pattern sequencing. I haven't investigated this fully, but it may be possible to do backtracking pattern matching in Factor, also, using call/cc and, from that, amb. Another extension that we both may be able to use is more involved unification semantics along the lines of Prolog or Erlang.

A couple things about it annoyed me, though. First, it didn't really provide a good-looking syntax and wouldn't work as a drop-in replacement for existing Haskell pattern matching. Second, the implementation required a language extension for the syntax to work properly. This is a symptom of a general problem with Haskell, that the language isn't as extensible as people like to think it is. It is much more common seeing a paper which proposes a language extension than a paper which proposes an idea and provides a Template Haskell implementation of it. My last problem with the paper is that it skipped over multiargument (curried) constructors, saying "Curried constructors are allowed, but the pattern law is much more straightforward when constructors are uncurried."

But if you're interested in models of pattern matching, the paper is definitely worth a read. It's rigorous where my blog post made blind implicit assertions that something works correctly. And it provides several combinators for processing and composing patterns in a manner somewhat broader than existing pattern matching techniques. It's also interesting to look at how the type system of Haskell interacts with pattern matching, and how pattern matching based on inverses works out in an applicative language.

Anyway, I just thought that it was necessary for people to know that I wasn't the first person to come up with that version of pattern matching.

Update: I kinda rushed this blog post out initially, so I went back and edited it.

Posted by Unknown at 11:55 AM No comments:

Labels: pattern matching

Saturday, June 30, 2007

Concatenative pattern matching

Finally, after more than two years of flopping around in my brain, I've implemented a kind of concatenative pattern matching in Factor. Not sure how useful it is, but it was fun. And this blog is called useless Factor for a reason: because I explain useless stuff. I'm not giving software engineering lectures here.

Anyway, it's not really specific to Factor; it'd be fine to implement in Joy too, or basically any other dynamically typed concatentive language where quotations can be manipulated as lists/arrays (though the outcome probably wouldn't be as flexible in Joh, because of Factor's word properties). The basic concept is simple, just a requiring a rethinking of pattern matching. Maybe I misunderstood it, but I always thought of ML-style pattern matching as doing two things: matching numeric literals for equality and examining instances of algebraic datatypes. Based on these principles, Chris Double made a great pattern matching library, currently residing in extra/match. But I've always had a problem with it, due to its heavy reliance on named variables. That doesn't detract from its utility; it just bothers me.

A more concatenative solution requires a reanalysis of the problem. Viewed more generally, pattern matching is the inverse of computation. What do I mean by that? Well, take a simple example in Haskell:


data List a = Nil | Cons a (List a)

sum :: Num a => List a -> a
sum Nil = 0
sum (Cons hd tl) = hd + sum tl

example :: Int
example = sum (Cons 4 (Cons 2 (Cons 5 Nil)))

(If this doesn't make sense, you should read this terse but good introduction to Haskell.)

So, you can think of sum as "undoing" the Cons function, when was called, and yielding the arguments. But when Nil was used, it "undoes" that (returning no arguments). When you think of pattern matching in this way, it makes it easier to translate things into a useful form for concatenative languages.

Here's a simple, useless Haskell example of the simplest form of pattern matching, just plain old undoing:


Cons 1 x = Cons 1 2

This sets the variable x to 2. It does this in a number of steps: first, it constructs the Cons 1 2, evaluating the right hand side. Then, it moves on to the left hand side. It checks that the value is a Cons, otherwise it fails. Next, it destructures the Cons, giving the values 1 and 2. With these values, it tries to match the 1 to 1 (succeeding) and the 2 to x (setting 2 to x). With slightly altered code, the match would fail, yielding a runtime error:


Cons 5 x = Cons 1 2

Obviously, 1 does not match 5, so this fails.

So it looks like the operation here is an unconditional undo. How could we implement this in a concatenative language? It's simple: just reverse the quotation and undo each word in order. Let's take the simple example above and translate it into Factor:


TUPLE: cons hd tl ;
1 2 <cons> [ 1 swap <cons> ] undo

This constructs a cons cell, and then undoes the construction operation. It matches the cons-hd to 1 and the cons-tl is left on the stack at the end of the undone operation. If, for some reason, we didn't have a cons cell, or if the cons-hd wasn't 1, there would be a runtime error, something along the lines of "Unification failed".

So, the inverse of [ 1 swap <cons> ] is the inverse of <cons> followed by the inverse of swap followed by the inverse of 1. The inverse of <cons> just checks to see if the value is a cons. If yes, then return the values of the slots (except the delegate). If no, fail. The inverse if swap is, simply, swap. And the inverse of 1 is to pop the top of the stack, and fail if it is not 1. So, all put together, the inverse code could be written as


: undo-cons-swap-1 ( cons -- tl )
 dup cons? [ fail ] unless [ cons-hd ] keep cons-tl ! inverse of <cons>
 swap ! inverse of swap
 1 = [ fail ] unless ; ! inverse of 1

(In reality, it's a bit more complicated when you include unbound variables, so you could undo, for example, dup drop, but let's not get into that just now)

A second operation is testing if match succeeds or fails, returning a boolean. Imagine the Haskell code


cons1 :: List -> Bool
cons1 Cons 1 _ = True
cons1 _ = False

Below, I'll discuss another way to implement this, but one thing that's possible is to just have [ 1 swap <cons> ] matches?. This attempts to match the item at the top of the stack to the undone operation. If it succeeds, it returns true, otherwise it fails. Regardless of the success or failure, everything else left on the stack by the undo operation is dropped.

The only real challenge here is what to do with errors. A unification failure should be caught and returned as f, while, say, a stack underflow error should be propagated. In Factor, recover catches all exceptions indiscriminately. To have this behavior, I had to explicitly rethrow errors which weren't unification failures.

Imagine this syntax in Factor for the original Haskell sum function:


TUPLE: nil ;
TUPLE: cons hd tl ;

: sum ( list -- sum )
 {
 { [ <nil> ] [ 0 ] }
 { [ <cons> ] [ sum + ] }
 } which ;

: example ( -- num )
 4 2 5 <nil> <cons> <cons> <cons> sum ;

which is the name I chose for a pattern-matching case: it chooses which of the options matches the data. It's like a Haskell case statement. For each 2array, it tries to undo the first part. If it succeeds, the associated quotation is executed. Otherwise, it moves on to the next 2array. The code it compiles to looks something like this:


: sum ( list -- sum )
 [ [ <nil> ] undo 0 ] [
 [ [ <cons> ] undo sum + ]
 [ no-match ] recover-fail
 ] recover-fail ;

where recover-fail is a word which is like recover but propagates all errors which are not unification failures.

This should all generate relatively efficient code. But even if it weren't for that, I'm more glad that I got this idea written out, because it's just conceptually interesting.

The code is at extra/inverse if you want to look at it. I haven't written documentation yet.

Note: If you are a Haskeller, you were probably horribly offended by my explanations of Haskell pattern matching. For a correct description of what's going on, see this. I oversimplified and made it sound a bit more imperative than it is.

PS. The reason I say this is concatenative pattern matching is that patterns can be concatenated (backwards). For example, [ 1 swap <cons> ] undo is the same as [ <cons> ] undo [ 1 swap ] undo, which is the same as [ <cons> ] undo [ swap ] undo [ 1 ] undo. This grouping is associative. So it follows the laws of concatenative languages, as outlined by Manfred von Thun. By comparison, pattern matching along the lines of extra/match is applicative, applying constructor functions to arguments, which are constants or variables. My goal here was really more to show that concatenative pattern matching is possible rather than to build something useful.

Posted by Unknown at 12:42 PM 3 comments:

Labels: factor, pattern matching