Friday, October 19, 2018
Lisp, mud, and wikis
APL is like a diamond. It has a beautiful crystal structure; all of its parts are related in a uniform and elegant way. But if you try to extend this structure in any way — even by adding another diamond — you get an ugly kludge. LISP, on the other hand, is like a ball of mud. You can add any amount of mud to it [...] and it still looks like a ball of mud!— R1RS (1978) page 29, paraphrasing remarks attributed to Joel Moses. (The merits of the attribution are extensively discussed in the bibliography of the HOPL II paper on Lisp by Steele and Gabriel (SIGPLAN Notices 28 no. 3 (March 1993), p. 268), but not in either freely-available on-line version of that paper.)
Modern programming languages try to support extension through regimentation, a trend popularized with the structured languages of the 1970s; but Lisp gets its noted extensibility through lack of regimentation. It's not quite that simple, of course, and the difference is why the mud metaphor works so well. Wikis are even muddier than Lisp, which I believe is one of the keys to their power. The Wikimedia Foundation has been trying lately to make wikis cleaner (with, if may I say so, predictably crippling results); but if one tries instead to introduce a bit of interaction into wiki markup (changing the texture of the mud, as it were), crowd-sourcing a wiki starts to look like a sort of crowd-sourced programming — with some weird differences from traditional programming. In this post I mean to explore wiki-based programming and try to get some sense of how deep the rabbit hole goes.
This is part of a bigger picture. I'm exploring how human minds and computers (more generally, non-sapiences) compare, contrast, and interact. Computers, while themselves not capable of being smart or stupid, can make us smarter or stupider (post). Broadly: They make us smarter when they cater to our convenience, giving full scope to our strengths while augmenting our weaknesses; as I've noted before (e.g. yonder), our strengths are closely related to language, which I reckon is probably why textual programming languages are still popular despite decades of efforts toward visual programming. They make us stupider when we cater to their convenience, which typically causes us to judge ourselves by what they are good at and find ourselves wanting. For the current post, I'll "only" tackle wiki-based programming.
I'll wend my way back to the muddiness/Lisp theme further down; first I need to set the stage with a discussion of the nature of wikis.
ContentsWiki markup
Wiki markup
Wiki philosophy
Nimble math
Brittle programming
Mud
Down the rabbit-hole
Wiki perspective
Technically, a wiki is a collection of pages written in wiki markup, a simple markup language designed for specifying multi-page hypertext documents, characterized by very low-overhead notations, making it exceptionally easy both to read and to write. This, however, is deceptive, because how easy wiki markup is isn't just a function of the ergonomics of its rules of composition, but also of how it's learned in a public wiki community — an instance of the general principle that the technical and social aspects of wikis aren't separable. Here's what I mean.
In wiki markup (I describe wikimedia wiki markup here), to link to another page on the wiki locally called "foo", you'd write [[foo]]; to link to it from some other text "text" instead of from its own name, [[foo|text]]. Specify a paragraph break by a blank line; italics ''baz'', boldface '''bar'''. Specify section heading "quux" by ==quux== on its own line, or ===quux=== for a subsection, ====quux==== for a subsubsection, etc.; a bulleted list by putting each item on its own line starting with *. That's most of wiki markup right there. These rules can be so simple because specifying a hypertext document doesn't usually require specifying a lot of complicated details (unlike, say, the TeX markup language where the point of the exercise is to specify finicky details of typesetting).
But this core of wiki markup is easier that the preceding paragraph makes it sound. Why? Because as a wiki user — at least, on a big pre-existing wiki such as Wikipedia, using a traditional raw-markup-editing wiki interface — you probably don't learn the markup primarily by reading a description like that, though you might happen across one. Rather, you start by making small edits to pages that have already been written by others and, in doing so, you happen to see the markup for other things near the thing you're editing. Here it's key that the markup is so easy to read that this incidental exposure to examples of the markup supports useful learning by osmosis. (It should be clear, from this, that bad long-term consequences would accrue from imposing a traditional WYSIWYG editor on wiki users, because a traditional WYSIWYG editor systematically prevents learning-by-osmosis during editing — because preventing the user from seeing how things are done under-the-hood is the purpose of WYSIWYG.)
Inevitably, there will be occasions when more complicated specifications are needed, and so the rules of wiki markup do extend beyond the core I've described. For example, embedding a picture on the page is done with a more elaborate variant of the link notation. As long as these occasional complications stay below some practical threshold of visual difficulty, though (and as long as they remain sufficiently rare that newcomers can look at the markup and sort out what's going on), the learning-by-osmosis effect continues to apply to them. You may not have, say, tinkered with an image on a page before, but perhaps you've seen examples of that markup around, and even if they weren't completely self-explanatory you probably got some general sense of them, so when the time comes to advance to that level of tinkering yourself, you can figure out most or all of it without too much of the terrible fate of reading a help page. Indeed, you may do it by simply copying an example from elsewhere and making changes based on common sense. Is that cargo-cult programming? Or, cargo-cult markup? Maybe, but the markup for images isn't actually all that complicated, so there probably isn't an awful lot of extra baggage you could end up carrying around — and it's in the nature of the wiki philosophy that each page is perpetually a work in progress, so if you can do well enough as a first approximation, you or others may come along later and improve it. And you may learn still more by watching how others improve what you did.
Btw, my description of help pages as a terrible fate isn't altogether sarcastic. Wikis do make considerable use of help pages, but for a simple reason unrelated to the relative effectiveness of help pages. The wiki community are the ones who possess know-how to perform expert tasks on the wiki, so the way to capture that knowledge is to crowd-source the capture to the wiki community; and naturally the wiki community captures it by doing the one thing wikis are good for: building hypertext documents. However, frankly, informational documentation is not a great way to pass on basic procedural knowledge; the strengths of informational documentation lie elsewhere. Information consumers and information providers alike have jokes about how badly documentation works for casual purposes — from the consumer's perspective, "when all else fails, read the instructions"; from the producer's, "RTFM".
Sometimes things get complicated enough that one needs to extend the markup. For those cases, there's a notation for "template calls", which is to say, macro calls; I'll have more to say about that later.
Wiki philosophyHere's a short-list of key wiki philosophical principles. They more-or-less define, or at least greatly constrain, what it is to be a wiki. I've chosen them with an evident bias toward relevance for the current discussion, and without attempting a comprehensive view of wiki philosophy — although I suspect most principles that haven't made my list are not universal even to the wikimedian sisterhood (which encompasses considerable variation from the familiar Wikipedian pattern).
Each page is perpetually a work in progress; at any given moment it may contain some errors, and may acquire new ones, which might be fixed later. Some pages may have some sort of "completed" state after which changes to them are limited, such as archives of completed discussions; but even archives are merely instances of general wiki pages and, on a sufficiently long time scale, may occasionally be edited for curational purposes.
Pages are by and for the People. This has (at least that I want to emphasize) two parts to it.
Anyone can contribute to a page. Some kinds of changes may have to be signed off on by users the community has designated as specially trusted for the purpose (whatever the purpose is, in that instance); but the driving force is still input from the broader public.
The human touch is what makes wikis worthwhile. Any automation introduced into a wiki setting must be strictly in the service of human decisions, always preserving — or, better, enhancing — the human quality of input. (One could write a whole essay on this point, someone probably should, and quite possibly someone has; but for now I'm just noting it.) People will be more passionate about donating their volunteer labor to a task that gives more scope for their creativity, and to a project that visibly embraces idealistic belief in the value of the human touch; and will lack passion for a task, and project, where their input is apparently no more valued than that of a bot.
Page specification is universally grounded in learn-by-osmosis wiki markup. I suspect this is often overlooked in discussions of wiki philosophy because the subject is viewed from the inside, where the larger sweep of history may be invisible. Frankly, looking back over the past five decades or so of the personal-computing revolution from a wide perspective, I find it glaringly obvious that this is the technical-side sine qua non of the success of wikis.
There is also a meta-principle at work here, deep in the roots of all these principles: a wiki is a human self-organizing system. The principles I've named provide the means for the system to self-organize; cripple them, and the system's dynamic equation is crippled. But this also means that we cannot expect to guide wikis through a conventional top-down approach (which is, btw, another reason why help pages don't work well on a wiki). Only structural rules that guide the self-organization can shape the wiki, and complicated structural rules will predictably bog down the system and create a mess; so core simplicity is the only way to make the wiki concept work.
The underlying wiki software has some particular design goals driven by the philosophical principles.
Graceful degradation. This follows from pages being works in progress; the software platform has to take whatever is thrown at it and make the best use of what's there. This is a point where it matters that the actual markup notations are few and simple: hopefully, most errors in a wiki page will be semantic and won't interfere with the platform's ability to render the result as it was intended to appear. Layout errors should tend to damp out rather than cascade, and it's always easier to fix a layout problem if it results in some sensible rendering in which both the problem and its cause are obvious.
Robustness against content errors. Complementary to graceful degradation: while graceful degradation maximizes positive use of the content, robustness minimizes negative consequences. The robustness design goal is driven both by pages being works in progress and by anyone being able to contribute, in that the system needs to be robust both against consequences of things done by mistake and against consequences of things done maliciously.
Radical flexibility. Vast, sweeping flexibility; capacity to express anything users can imagine, and scope for their imaginations. This follows from the human touch, the by and for the People nature of wikis; the point of the entire enterprise is to empower the users. To provide inherent flexibility and inherent robustness at the deep level where learning-by-osmosis operates and structural rules guide a self-organizing system, is quite an exercise in integrated design. One is reminded of the design principle advocated (though not entirely followed for some years now) by the Scheme reports, to design "not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary"; except, the principle is made even more subtle because placing numeric bounds on operations, in a prosaic technical sense, is often a desirable measure for robustness against content errors, so one has to find structural ways to enable power and flexibility in the system as a whole that coexist in harmony with selective robustness-driven numeric bounds.
Authorization/quality control. This follows from the combination of anyone being able to contribute with need for robustness (both malicious and accidental). A wiki community must be able to choose users who are especially trusted by the community; if it can't do that, it's not a community. Leveraging off that, some changes to the wiki can be restricted to trusted users, and some changes once made may have some sort of lesser status until they've been approved by a trusted user. These two techniques can blur into each other, as a less privileged user can request a change requiring privileges and, depending on how the interface works, the request process might look very similar to making a change subject to later approval.
As software platform design goals for a wiki, imho there's nothing particularly remarkable, let alone controversial, about these.
Nimble mathShifting gears, consider abstraction in mathematics. In my post some time back on types, I noted
In mathematics, there may be several different views of things any one of which could be used as a foundation from which to build the others. That's essentially perfect abstraction, in that from any one of these levels, you not only get to ignore what's under the hood, but you can't even tell whether there is anything under the hood. Going from one level to the next leaves no residue of unhidden details: you could build B from A, C from B, and A from C, and you've really gotten back to A, not some flawed approximation of it[.]Suppose we have a mathematical structure of some sort; for example, matroids (I had the fun some years back of taking a class on matroids from a professor who was coauthoring a book on the subject). There are a passel of different ways to define matroids, all equivalent but not obviously so (the Wikipedia article uses the lol-worthily fancy term cryptomorphic). Certain theorems about matroids may be easier to prove using one or another definition, but once such a theorem has been proven, it doesn't matter which definition we used when we proved it; the theorem is simply true for matroids regardless of which equivalent definition of matroid is used. Having proven it we completely forget the lower-level stuff we used in the proof, a characteristic of mathematics related to discharging the premise in conditional proof (which, interestingly, seems somehow related to the emergence of antinomies in mathematics, as I remarked in an earlier post).
Generality in mathematics is achieved by removing things; it's very easy to do, just drop some of the assumptions about the structures you're studying. Some of the theorems that used to hold are no longer valid, but those drop out painlessly. There may be a great deal of work involved in figuring which of the lost theorems may be re-proven in the broader setting, with or without weakening their conclusions, but that's just icing on the cake; nothing about the generalization can diminish the pre-existing math, and new results may apply to special cases as facilely as matroid theorems would leap from one definition to another. Specialization works just as neatly, offering new opportunities to prove stronger results from stronger assumptions without diminishing the more general results that apply.
Brittle programmingThere seems to be a natural human impulse to seek out general patterns, and then want to explain them to others. This impulse fits in very well with Terrance Deacon's vision of the human mind in The Symbolic Species (which I touched on tangentially in my post on natural intelligence) as synthesizing high-level symbolic ideas for the purpose of supporting a social contract (hence the biological importance of sharing the idea once acquired). This is a familiar driving force for generalization in mathematics; but I see it also as an important driving force for generalization in programming. We've got a little program to do some task, and we see how to generalize it to do more (cf. this xkcd); and then we try to explain the generalization. If we were explaining it to another human being, we'd probably start to explain our new idea in general terms, "hand-waving". As programmers, though, our primary dialog isn't with another human being, but with a computer (cf. Philip Guo's essay on User Culture Versus Programmer Culture). And computers don't actually understand anything; we have to spell everything out with absolute precision — which is, it seems to me, what makes computers so valuable to us, but also trips us up because in communicating with them we feel compelled to treat them as if they were thinking in the sense we do.
So instead of a nimble general idea that synergizes with any special cases it's compatible with, the general and specific cases mutually enhancing each other's power, we create a rigid framework for specifying exactly how to do a somewhat wider range of tasks, imposing limitations on all the special cases to which it applies. The more generalization we do, the harder it is to cope with the particular cases to which the general system is supposed to apply.
It's understandable that programmers who are also familiar with the expansive flexibility of mathematics might seek to create a programming system in which one only expresses pure mathematics, hoping thereby to escape the restrictions of concrete implementation. Unfortunately, so I take from the above, the effort is misguided because the extreme rigidity of programming doesn't come from the character of what we're saying — it comes from the character of what we're saying it to. If we want to escape the problem of abstraction, we need a strategy to cope with the differential between computation and thought.
MudLisp [...] made me aware that software could be close to executable mathematics.— ACM Fellow Profile of L. Peter Deutsch.
Part of the beauty of the ball-of-mud quote is that there really does seem to be a causal connection between Lisp's use of generic, unencapsulated, largely un-type-checked data structures and its extensibility. The converse implication is that we should expect specialization, encapsulation, and aggressive type-checking of data structures — all strategies extensively pursued in various combinations in modern language paradigms — each to retard language extension.
Each of these three characteristics of Lisp serves to remove some kind of automated restriction on program form, leaving the programmer more responsible for judging what is appropriate — and thereby shifting things away from close conversation with the machine, enabling (though perhaps not actively encouraging) more human engagement. The downside of this is likely obvious to programming veterans: it leaves things more open to human fallibility. Lisp has a reputation as a language for good programmers. (As a remark attributed to Dino Dai Zovi puts it, "We all know that Lisp is the best language around, but in the hands of most it becomes like that scene in Fantasia when Mickey Mouse gets the wand.")
Even automatic garbage collection and bignums can be understood as reducing the depth of the programmer's conversation with the computer.
I am not, just atm, taking sides on whether the ability to define specialized data types makes a language more or less "general" than doing everything with a single amorphous structure (such as S-expressions). If you choose to think of a record type as just one example of the arbitrary logical structures representable using S-expressions, then a minimal Lisp, by not supporting record types, is declining to impose language-level restrictions on the representing S-expressions. If, on the other hand, you choose to think of a cons cell as just one example of a record type (with fields car and cdr), then "limiting" the programmer to a single record type minimizes the complexity of the programmer's forced interaction with the language-level type restrictions. Either way, the minimal-Lisp approach downplays the programmer's conversation with the computer, leaving the door open for more human engagement.
Wikis, of course, are pretty much all about human engagement. While wiki markup minimizes close conversation with the computer, the inseparable wiki social context does actively encourage human engagement.
Down the rabbit-holeAt the top of the post I referred to what happens if one introduces a bit of interaction into wiki markup. But, as with other features of wiki markup, this what-if cannot be separate from the reason one would want to do it. And as always with wikis, the motive is social. The point of the exercise — and this is being tried — is to enhance the wiki community's ability to capture their collective expertise at performing wiki tasks, by working with and enhancing the existing strengths of wikis.
The key enabling insight for this enhancement was that by adding a quite small set of interaction primitives to wiki markup (in the form, rather unavoidably, of a small set of templates), one can transform the whole character of wiki pages from passive to active. Most of this is just two primitives: one for putting text input boxes on a page, and another for putting buttons on the page that, when clicked, take the data in the various input boxes from the page and send it somewhere — to be transformed, used to initialize text boxes on another page, used to customize the appearance of another page, or, ultimately, used to devise an edit to another page. Suddenly it's possible for a set of wiki pages to be, in effect, an interactive wizard for performing some task, limited only by what the wiki community can collectively devise.
Notice the key wiki philosophical principles still in place: each page perpetually a work in progress, by and for the people, with specification universally grounded in learn-by-osmosis wiki markup. It seems reasonable to suppose, therefore, that the design principles for the underlying wiki software, following from those wiki philosophical principles, ought also to remain in place. (In the event, the linked "dialog" facility is, in fact, designed with graceful degradation, robustness, and flexibility in mind and makes tactical use of authorization.)
Consider, though, the nature of the new wiki content enabled by this small addition to the primitive vocabulary of wiki markup. Pages with places to enter data and buttons that then take the entered data and, well, do stuff with it. At the upper end, as mentioned, a collection of these would add up to a crowd-sourced software wizard. The point of the exercise is to bring the wiki workflow to bear on growing these things, to capture the expertise uniquely held by the wiki community; so the wiki philosophical principles and design principles still apply, but now what they're applying to is really starting to look like programming; and whereas all those principles were pretty tame when we were still thinking of the output of the process as hypertext, applying them to a programming task can produce some startling clashes with the way we're accustomed to think about programming — as well as casting the wiki principles themselves in a different light than in non-interactive wikis.
Wikis have, as noted, a very mellow attitude toward errors on a wiki page. Programmers as a rule do not. The earliest generation of programmers seriously expected to write programs that didn't have bugs in them (and I wouldn't bet against them); modern programmers have become far more fatalistic about the occurrences of bugs — but one attitude that has, afaics, never changed is the perception that bugs are something to be stamped out ASAP once discovered. Even if some sort of fault-tolerance is built into a computer system, the gut instinct is still that a bug is an inherent evil, rather than a relatively less desirable state. It's a digital rather than analog, discrete rather than continuous, view. Do wikis want to eliminate errors? Sure; but it doesn't carry the programmer's sense of urgency, of need to exterminate the bug with prejudice. Some part of this is, of course, because a software bug is behavioral, and thus can actually do things so its consequences can spread in a rapid, aggressive way that passive data does not... although errors in Wikipedia are notorious for they way they can spread about the infosphere — at a speed more characteristic of human gossip than computer processing.
The interactive-wiki experiment was expected from the outset to be, in its long-term direction, unplannable. Each individual technical step would be calculated and taken without probing too strenuously what step would follow it. This would be a substantially organic process; imagine a bird adding one twig at a time to a nest, or a painter carefully thinking out each brush stroke... or a wiki page evolving through many small edits.
In other words, the interactivity device —meant to empower a wiki community to grow its own interactive facilities in much the same way the wiki community grows the wiki content— would itself be developed by a process of growth. Here's it's crucial —one wants to say, vital— that the expertise to be captured is uniquely held by the wiki community. This is also why a centralized organization, such as the Wikimedia Foundation, can't possibly provide interactive tools to facilitate the sorts of wiki tasks we're talking about facilitating: the centralized organization doesn't know what needs doing or how to do it, and it would be utterly unworkable to have people petitioning a central authority to please provide the tools to aid these things. The central authority would be clueless about what to do at every single decision point along the way, from the largest to the smallest decision. The indirection would be prohibitively clumsy, which is also why wiki content doesn't go through a central authority: wikis are a successful content delivery medium because when someone is, say, reading a page on Wikipedia and sees a typo, they can just fix it themselves, in what turns out to be a very straightforward markup language, rather than going through a big, high-entry-cost process of petitioning a central authority to please fix it. What it all adds up to, on the bottom line, is the basic principle that wikis are for the People, and that necessarily applies just as much to knowledge of how to do things on the wiki as it does to the provided content.
Hence, that the on-wiki tasks need to be semi-automated, not automated as such: they're subject to concerns similar to drive-by-wire or fly-by-wire, in which Bad Things Happen if the interface is designed in a way that cuts the human operator out of the process. Not all the problems of fly/drive-by-wire apply (I've previously ranted, here and there, on misdesign of fly-by-wire systems); but the human operator needs not only to be directly told some things, but also to be given peripheral information. Sapient control of the task is essential, and in the case of wikis is an important part of the purpose of the whole exercise; and peripheral information is a necessary enabler for sapient control: the stuff the human operator catches in the corner of their eye allows them to recognize when things are going wrong (sometimes through subliminal clues that low-bandwidth interfaces would have excluded, sometimes through contextual understanding that automation lacks); and peripheral information also enables the training-by-osmosis that makes the user more able to do things and recognize problems subliminally and have contextual understanding.
These peculiarities are at the heart of the contrast with conventional programming. Most forms of programming clearly couldn't tolerate the mellow wiki attitude toward bugs; but we're not proposing to do most forms of programming. An on-wiki interactive assistant isn't supposed to go off and do things on its own; that sort of rogue software agent, which is the essence of the conventional programmer's zero-tolerance policy toward bugs, would be full automation, not semi-automation. Here the human operator is supposed to be kept in the loop and well-informed about what is done and why and whatever peripheral factors might motivate occasional exceptions, and be readily empowered to deviate from usual behavior. And when the human operator thinks the assistant could be improved, they should be empowered to change or extend it.
At this point, though, the rabbit hole rather abruptly dives very deep indeed. In the practical experiment, as soon as the interaction primitives were on-line and efforts began to apply them to real tasks, it became evident that using them was fiendishly tricky. It was really hard to keep track of all the details involved, in practice. Some of that, of course, has to be caused by the difficulty of overlaying these interactions on top of a wiki platform that doesn't natively support them very well (a politically necessary compromise, to operate within the social framework of the wikimedian sisterhood); but a lot of it appears to be due to the inherent volatility wiki pages take on when they become interactive. This problem, though, suggests its own solution. We've set out to grow interactive assistants to facilitate on-wiki tasks. The growing of interactive assistants is itself an on-wiki task. What we need is a meta-assistant, a semi-automated assistant to help users with the creation and maintenance of semi-automated assistants.
It might seem as if we're getting separated from the primitive level of wiki markup, but in fact the existence of that base level is a necessary stabilizing factor. Without a simple set of general primitives underneath, defining an organic realm of possibilities for sapient minds to explore, any elaborate high-level interface naturally devolves into a limited range of possibilities anticipated by the designer of the high-level interface (not unlike the difference between a multiple-choice quiz and an essay question; free-form text is where sapient minds can run rings around non-sapient artifacts).
It's not at all easy to design a meta-assistant to aid managing high-level design of assistants while at the same time not stifling the sapient user's flexibility to explore previously unimagined corners of the design space supported by the general primitives. Moreover, while pointedly not limiting the user, the meta-assistant can't help guiding the user, and while this guidance should be generally minimized and smoothed out (lest the nominal flexibility become a sudden dropping off into unsupported low-level coding when one steps off the beaten path), the guidance also has to be carefully chosen to favor properties of assistants that promote success of the overall interactive enterprise; as a short list,
- Avoid favoring negative features such as inflexibility of the resulting assistants.
- Preserve the reasoning behind exceptions, so that users aren't driven to relitigate exceptional decisions over and over until someone makes the officially "unexceptional" choice.
- Show the user, unobnoxiously, what low-level manipulations are performed on their behalf, in some way that nurtures learning-by-osmosis. (On much the same tack, assistants should be designed —somehow or other— to coordinate with graceful degradation when the interactivity facilities aren't available.)
- The shape of the assistants has to coordinate well with the strategy that the meta-assistant uses to aid the user in coping with failures of behavior during assisted operations — as the meta-assistant aids in both detecting operational problems and in adjusting assistants accordingly.
- The entire system has to be coordinated to allow recovery of earlier states of assistants when customizations to an assistant don't work out as intended — which becomes extraordinarily fraught if one contemplates customizations to the meta-assistant (at which point, one might consider drawing inspiration from the Lispish notion of a reflective tower; I'm deeply disappointed, btw, to find nothing on the wikimedia sisterhood about reflective towers; cf. Wand and Friedman's classic paper [pdf]).
Wiki perspective
As noted earlier, the practical experiment is politically (thus, socially) constrained to operate within the available wikimedia platform, using strictly non-core facilities —mostly, JavaScript— to simulate interactive primitives. Despite some consequent awkward rough spots at the interface between interactivity and core platform, a lightweight prototype seemed appropriate to start with because the whole nature of the concept appeared to call for an agile approach — again, growing the facility. However, as a baseline of practical interactive-wiki experience has gradually accumulated, it is by now possible to begin envisioning what an interactive-core wiki platform ought to look like.
Beyond the basic interactive functionality itself, there appear to be two fundamental changes wanted to the texture of the platform interface.
The basic interaction functionality is mostly about moving information around. Canonically, interactive information originates from input fields on the wiki page, each specified in the wiki markup by a template call (sometimes with a default value specified within the call); and interactive information is caused to move by clicking a button, any number of which may occur on the wiki page, each also specified in the wiki markup by a template call. Each input field has a logical name within the page, and each button names the input fields whose data it is sending, along with the "action" to which the information is to be sent. There are specialized actions for actually modifying the wiki —creating or editing a page, or more exotic things like renaming, protecting, or deleting a page— but most buttons just send the information to another page, where two possible things can happen to it. Entirely within the framework of the interactive facility, incoming information to a page can be used to initialize an input field of the receiving page; but, more awkwardly (because it involves the interface between the lightweight interactive extension and the core non-interactive platform), incoming information to a page can be fed into the template facility of the platform. This wants a bit of explanation about wiki templates.
Wiki pages are (for most practical purposes) in a single global namespace, all pages fully visible to each other. A page name can be optionally separated into "fields" using slashes, a la UNIX file names, which is useful in keeping track of intended relationships between pages, but with little-to-no use of relative names: page names are almost always fully qualified. A basic "template call" is delimited on the calling page by double braces ("{{}}") around the name of the callee; the contents of the callee are substituted (in wiki parlance, transcluded) into the calling page at the point of call. To nip potential resource drains (accidental or malicious) in the bud, recursion is simply disallowed, and a fairly small numerical bound is placed on the depth of nested calls. Optionally, a template can be parameterized, with arguments passed through the call, using a pipe character ("|") to separate the arguments from the template name and from each other; arguments may be explicitly named, or unnamed in which case the first unnamed parameter gets the name "1", the second "2", and so on. On the template page, a parameter name delimited by triple curly braces ("{{{}}}") is replaced by the argument with that name when the template is transcluded.
In conventional programming-language terms, wiki template parameters are neither statically nor dynamically scoped; rather, all parameter names are strictly local to the particular template page on which they occur. This is very much in keeping with the principle of minimizing conversation with the machine; the meaning-as-template of a page is unaffected by the content of any other page whatever. Overall, the wiki platform has a subtle flavor of dynamic scoping about it, but not from parameter names, nor from how pages name each other; rather, from how a page names itself. The wiki platform provides a "magic word" —called like a template but performing a system service rather than transcluding another page— for extracting the name of the current page, {{PAGENAME}} (this'll do for a simple explanation). But here's the kicker: wiki markup {{PAGENAME}} doesn't generate the name of the page on which the markup occurs, but rather, the name of the page currently being displayed. Thus, when you're writing a template, {{PAGENAME}} doesn't give you the name of the template you're writing, but the name of whatever page ultimately calls it (which might not even be the name of the page that directly calls the one you're writing). This works out well in practice, perhaps because it focuses you on the act of display, which is after all the proper technical goal of wiki markup. (There was, so I understand, a request long ago from the wikimedia user community for a magic word naming the page on which the markup occurs, but the platform developers declined the request; apparently, though we've already named here a couple of pretty good design reasons not to have such a feature —minimal conversation, and display focus— the developers invoked some sort of efficiency concern.)
Getting incoming parameters from the dialog system to the template system can be done, tediously under-the-hood, by substituting data into the page — when the page is being viewed by means of an "action" (the view action) rather than directly through the wiki platform. This shunting of information back-and-forth between dialog and template is awkward and falls apart on some corner cases. Presumably, if the interactive facility were integrated into the platform, there ought to be just one kind of parameter, and just one kind of page-display so that the same mechanism handles all parameters in all cases. This, however, implies the first of the two texture changes indicated to the platform. The wikimedia API supports a request to the server to typeset a wiki text into html, which is used by the view action when shunting information from dialog parameters to template parameters; but, crucially, the API only supports this typesetting on an unadorned wiki text — that is, a wiki text without parameters.
To properly integrate interactivity into the platform, both the fundamental unit of modular wiki information, and the fundamental unit of wiki activity, have to change, from unadorned to parameterized. The basic act of the platform is then not displaying a page, but displaying a page given a set of arguments. This should apply, properly, to both the API and the user interface. Even without interactivity, it's already awkward to debug a template because one really wants to watch what happens as the template receives arguments, and follow the arguments as they flow downward and results flow back upward through nested template calls; which would require an interface oriented toward the basic unit of a wiki text plus arguments, rather than just a wiki text. Eventually, a debugging interface of this sort may be constructed on top of the dialog facility for wikimedia; in a properly integrated system, it would be basic platform functionality.
The second texture change I'm going to recommend to the platform is less obvious, but I've come to see it as a natural extension of shifting from machine conversation to human.
Wiki template calls are a simple process (recall the absence of complicated scoping rules) well grounded in wiki markup; but even with those advantages, as nested call depth increases, transclusion starts to edge over into computational territory. The Wikimedia Foundation has apparently developed a certain corporate dislike for templates, citing caching difficulties (which they've elected not to try to mitigate as much as they might) and computational expense, and ultimately turning to an overtly computational (and, by my reckoning, philosophically anti-wiki) use of Lua to implement templates under-the-hood.
As a source of inspiration, though, consider transitive closures. Wikibooks (another wikimedia sister of Wikinews and Wikipedia) uses one of these, where the entire collection of about 3000 books are hierarchically arranged in about 400 subjects, and a given book when filed in a subject has to be automatically put in all the ancestors of that subject. Doing this with templates, without recursion it can still be managed by a chain of templates calling each other in sequence (with some provision that if the chain isn't long enough, human operator intervention can be requested to lengthen it); but then, there's also the fixed bound on nesting depth. In practice the fixed bound only supports a tower of six or seven nested subjects. This could, of course, be technically solved the Foundation's way, by replacing the internals of the template with a Lua module which would be free to use either recursion or iteration, at the cost of abandoning some central principles of wiki philosophy; but there's another way. If we had, for each subject, a pre-computed list of all its ancestors, we wouldn't need recursion or iteration to simply file the book in all of them. So, for each subject keep a list of its ancestors tucked away in an annex to the subject page; and let the subject page check its list of ancestors, to make sure it's the same as what you get by merging its parents with its parent's lists of ancestors (which the parents, of course, are responsible for checking); and if the check fails, request human operator intervention to fix it — preferably, with a semi-automated assistant to help. If someone changes the subject hierarchy, recomputing various ancestor lists then needs human beings to sign off on the changes (which is perhaps just as well, since changing the subject hierarchy is a rather significant act that ought to draw some attention).
This dovetails nicely into a likely strategy for avoiding bots —fully automated maintenance-task agents, which mismatch the by and for the People wiki principle— by instead providing for each task a semi-automated assistant and a device for requesting operator intervention. So, what if we staggered all template processing into these sorts of semi-automated steps? This might also dovetail nicely with parameterizing the fundamental unit of modular wiki information, as a deferred template call is just this sort of parameterized modular unit.
There are some interesting puzzles to work out around the edges of this vision. Of particular interest to me is the template call mechanism. On the whole it works remarkably well for its native purpose; one might wonder, though, whether it's possible (and whether it's desirable, which does not go without saying) to handle non-template primitives more cleanly, and whether there is anything meaningful to be applied here from the macro/fexpr distinction. The practical experiment derives an important degree of agility from the fact that new "actions" can be written on the fly in javascript (not routinely, of course, but without the prohibitive political and bureaucratic hurdles of altering the wikimedia platform). Clearly there must be a line drawn somewhere between the province of wiki markup presented to the wiki community, and other languages used to implement the platform; but the philosophy I've described calls for expanding the wiki side of that boundary as far as possible, and if one is rethinking the basic structure of the wiki platform, that might be a good moment to pause and consider what might be done on that front.
As a whole, though, this prospect for an inherently interactive wiki platform seems to me remarkably coherent, sufficiently that I'm very tempted to thrash out the details and make it happen.
Monday, August 29, 2016
Interpreted programming languages
Last night I drifted off while reading a Lisp book.— xkcd 224.
It's finally registered on me that much of the past half century of misunderstandings and confusions about the semantics of Lisp, of quotation, and, yes, of fexprs, can be accounted for by failure to recognize there is a theoretical difference between an interpreted programming language and a compiled programming language. If we fail to take this difference into account, our mathematical technology for studying compiled programming languages will fail when applied to interpreted languages, leading to loss of coherence in language designs and a tendency to blame the language rather than the theory.
Technically, a compiler translates the program into an executable form to be run thereafter, while an interpreter figures out what the program says to do and just does it immediately. Compilation allows higher-performance execution, because the compiler takes care of reasoning about source-code before execution, usually including how to optimize the translation for whatever resources are prioritized (time, space, peripherals). It's easy to suppose this is all there is to it; what the computer does is an obvious focus for attention. One might then suppose that interpretation is a sort of cut-rate alternative to compilation, a quick-and-dirty way to implement a language if you don't care about performance. I think this misses some crucial point about interpretation, some insight to be found not in the computer, but in the mind of the programmer. I don't understand that crucial insight clearly enough — yet — to focus a whole blog post on it; but meanwhile, there's this theoretical distinction between the two strategies which also isn't to be found in the computer's-eye view, and which I do understand enough about to focus this blog post on.
It's not safe to say the language is the same regardless of which way it's processed, because the language design and the processing strategy aren't really independent. In principle a given language might be processed either way; but the two strategies provide different conceptual frameworks for thinking about the language, lending themselves to different language design choices, with different purposes — for which different mathematical properties are of interest and different mathematical techniques are effective. This is a situation where formal treatment has to be considered in the context of intent. (I previously blogged about another case where this happens, formal grammars and term-rewriting calculi, which are both term-rewriting systems but have nearly diametrically opposite purposes; over thar.)
I was set onto the topic of this post recently by some questions I was asked about Robert Muller's M-Lisp. My dissertation mentions Muller's work only lightly, because Muller's work and mine are so far apart. However, because they seem to start from the same place yet lead in such different directions, one naturally wants to know why, and I've struck on a way to make sense of it: starting from the ink-blot of Lisp, Muller and I both looked to find a nearby design with greater elegance — and we ended up with vastly different languages because Muller's search space was shaped by a conceptual framework of compilation while mine was shaped by a conceptual framework of interpretation. I will point out, below, where his path and mine part company, and remark briefly on how this divergence affected his destination.
The mathematical technology involved here, I looked at from a lower-level perspective in an earlier post. It turns out, from a higher-level perspective, that the technology itself can be used for both kinds of languages, but certain details in the way it is usually applied only work with compiled languages and, when applied to interpreted languages, result in the trivialization of theory noted by Wand's classic paper, "The Theory of Fexprs is Trivial".
ContentsM-language
M-language
Compilation
Universal program
S-language
Half a century's worth of misunderstandings and confusions
I'll explore this through a toy programming language which I'll then modify, starting with something moderately similar to what McCarthy originally described for Lisp (before it got unexpectedly implemented).
This is a compiled programming language, without imperative features, similar to λ-calculus, for manipulating data structures that are nested trees of atomic symbols. The syntax of this language has two kinds of expressions: S-expressions (S for Symbolic), which don't specify any computation but merely specify constant data structures — trees of atomic symbols — and M-expressions (M for Meta), which specify computations that manipulate these data structures.
S-expressions, the constants of the language, take five forms:
S ::= s | () | #t | #f | (S . S) .That is, an S-expression is either a symbol, an empty list, true, false, or a pair (whose elements are called, following Lisp tradition, its car and cdr). A symbol name is a sequence of one or more upper-case letters. (There should be no need, given the light usage in this post, for any elaborate convention to clarify the difference between a symbol name and a nonterminal such as S.)
I'll assume the usual shorthand for lists, (S ...) ≡ (S . (...)), so for example (FOO BAR QUUX) ≡ (FOO . (BAR . (QUUX . ()))); but I won't complicate the formal grammar with this shorthand since it doesn't impact the abstract syntax.
The forms of M-expressions start out looking exactly like λ-calculus, then add on several other compound forms and, of course, S-expressions which are constants:
M ::= x | [λx.M] | [M M] | [if M M M] | [car M] | [cdr M] | [cons M M] | [eq? M M] | S .The first form is a variable, represented here by nonterminal x. A variable name will be a sequence of one or more lower-case letters. Upper- versus lower-case letters is how McCarthy distinguished between symbols and variables in his original description of Lisp.
Variables, unlike symbols, are not constants; rather, variables are part of the computational infrastructure of the language, and any variable might stand for an arbitrary computation M.
The second form constructs a unary function, via λ; the third applies a unary function to a single argument. if expects its first argument to be boolean, and returns its second if the first is true, third if the first is false. car and cdr expect their argument to be a pair and extract its first and second element respectively. cons constructs a pair. eq? expects its two arguments to be S-expressions, and returns #t if they're identical, #f if they're different.
Compound unary functions, constructed by λ, are almost first-class. They can be returned as the values of expressions, and they can be assigned to variables; but as the grammar is set out, they cannot appear as elements of data structures. A pair expression is built up from two S-expressions, and a compound unary function is not an S-expression. McCarthy's original description of Lisp defines S-expressions this way; his stated purpose was only to manipulate trees of symbols. Trained as I am in the much later Lisp culture of first-class functions and minimal constraints, it felt unnatural to follow this narrower definition of S-expressions; but in the end I had to define it so. I'm trying to reproduce the essential factors that caused Lisp to come out the way it did, and, strange to tell, everything might have come out differently if not for the curtailed definition of S-expressions.
(One might ask, couldn't we indirectly construct a pair with a function in it using cons? A pair with a function in it would thus be a term that can't be represented directly by a source expression. This point likely doesn't matter directly to how things turned out; but fwiw, I suspect McCarthy didn't have in mind to allow that, no. It's entirely possible he also hadn't really thought about it yet at the preliminary stage of design where this detail had its impact on the future of Lisp. It's the sort of design issue one often discovers by playing around with a prototype — and in this case, playing around with a prototype is how things got out of hand; more on that later.)
CompilationSomehow we want to specify the meanings of programs in our programming language. Over the decades, a number of techniques for formal PL semantics have been entertained. One of the first things tried was to set up a term-rewriting machine modeled roughly on λ-calculus, that would perform small finite steps until it reached a final state; that was Peter Landin's notion, when Christopher Strachey set him onto the problem in the 1960s. It took some years to get the kinks out of that approach, and meanwhile other techniques were tried — such as denotational semantics, meta-circular evaluation — but setting up a term-rewriting calculus has been quite a popular technique since the major technical obstacles to it were overcome. Of the three major computational models tied together in the 1930s by the Church-Turing thesis, two of them were based on term-rewriting: Turing machines, which were the convincing model, the one that lent additional credibility to the others when they were all proven equivalent; and λ-calculus, which had mathematical elegance that the nuts-and-bolts Turing-machine model lacked. The modern "small-step" rewriting approach to semantics (as opposed to "big-step", where one deduces how to go in a single step from start to finish) does a credible job of combining the strengths of Turing-machines and λ-calculus; I've a preference for it myself, and it's the strategy I'll use here. I described the technique in somewhat more depth in my previous post on this material.
Small-step semantics applies easily to this toy language because every intermediate computational state of the system is naturally represented by a source-code expression. That is, there is no obvious need to go beyond the source-code grammar we've already written. Some features that have been omitted from the toy language would make it more difficult to limit all computational states to source expressions; generally these would be "stateful" features, such as input/output or a mutable store. Landin used a rewriting system whose terms (computational states) were not source-code expressions. One might ask whether there are any language features that would make it impossible to limit computational states to source expressions, and the answer is essentially yes, there are — features related not to statefulness, but to interpretation. For now, though, we'll assume that all terms are source expressions.
We can define the semantics of the language in reasonably few rewriting schemata.
[[λx.M1] M2] → M1[x ← M2]
[if #t M1 M2] → M1
[if #f M1 M2] → M2
[car (S1 . S2)] → S1
[cdr (S1 . S2)] → S2
[cons S1 S2] → (S1 . S2)
[eq? S S] → #t
[eq? S1 S2] → #f if the Sk are different.
Rewriting relation → is the compatible closure of these schemata; that is, for any context C and terms Mk, if M1 → M2 then C[M1] → C[M2]. Relation → is also Church-Rosser: although a given term M may be rewritable in more than one way, any resulting difference can always be eliminated by later rewriting. That is, the reflexive transitive closure →* has the diamond property: if M1 →* M2 and M1 →* M3, then there exists M4 such that M2 →* M4 and M3 →* M4.
Formal equality of terms, =, is the symmetric closure of →* (thus, the reflexive symmetric transitive compatible closure of the schemata, which is to say, the least congruence containing the schemata).
Another important relation is operational equivalence, ≅. Two terms are operationally equivalent just if replacing either by the other in any possible context preserves the observable result of the computation. M1 ≅ M2 iff for every context C and S-expression S, C[M1] ↦* S iff C[M2] ↦* S. (Fwiw, relation ↦ is what the computation actually does, versus → which is anything the rewriting calculus could do; → is compatible Church-Rosser and therefore nice to work with mathematically, but ↦ is deterministic and therefore we can be sure it does what we meant it to. Another way of putting it is that → has the mathematical character of λ-calculus while ↦ has the practical character of Turing-machines.)
Operational equivalence is exactly what must be guaranteed in order for an optimizing compiler to safely perform a local source-to-source transformation: as long as the two terms are operationally equivalent, the compiler can replace one with the other in any context. The rewriting calculus is operationally sound if formal equality implies operational equivalence; then the rewriting calculus can supply proofs of operational equivalence for use in optimizing compilation.
Before moving on, two other points of interest about operational equivalence:
Operational equivalence of S-expressions is trivial; that is, S1 and S2 are operationally equivalent only if they are identical. This follows immediately by plugging the trivial context into the definition of operational equivalence, C[Sk] ≡ Sk. Thus, in every non-trivial operational equivalence M1 ≅ M2, at least one of the Mk is not an S-expression.
All terms in the calculus — all M-expressions — are source expressions; but if there were any terms in the calculus that were not source expressions, they would be irrelevant to a source-to-source optimizer; however, if these non-source terms could be usefully understood as terms in an intermediate language used by the compiler, an optimizer might still be able to make use of them and their formal equalities.
McCarthy's Lisp language was still in its infancy when the project took an uncontrollable turn in a radically different direction than McCarthy had envisioned going with it. Here's what happened.
A standard exercise in theory of computation is to construct a universal Turing machine, which can take as input an encoding of an arbitrary Turing machine T and an input w to T, and simulate what T would do given input w. This is an extremely tedious exercise; the input to a Turing machine looks nothing like the control device of a Turing machine, so the encoding is highly intrusive, and the control device of the universal machine is something of an unholy mess. McCarthy set out to lend his new programming language mathematical street cred by showing that not only could it simulate itself with a universal program, but the encoding would be much more lucid and the logic simpler in contrast to the unholy mess of the universal Turing machine.
The first step of this plan was to describe an encoding of programs in the form of data structures that could be used as input to a program. That is to say, an encoding of M-expressions as S-expressions. Much of this is a very straightforward homomorphism, recursively mapping the non-data M-expression structure onto corresponding S-expressions; for our toy language, encoding φ would have
- φ(x) = symbol s formed by changing the letters of its name from lower-case to upper-case. Thus, φ(foo) = FOO.
- φ([λx.M]) = (LAMBDA φ(x) φ(M)).
- φ([M1 M2]) = (φ(M1) φ(M2)).
- φ([if M1 M2 M3]) = (IF φ(M1) φ(M2) φ(M3)).
- φ([car M]) = (CAR φ(M)).
- φ([cdr M]) = (CDR φ(M)).
- φ([cons M1 M2]) = (CONS φ(M1) φ(M2)).
- φ([eq? M1 M2]) = (EQ φ(M1) φ(M2)).
(This encoding ignores the small detail that certain symbol names used in the encoding — LAMBDA, IF, CAR, CDR, CONS, EQ — must not also be used as variable names, if the encoding is to behave correctly. McCarthy seems not to have been fussed about this detail, and nor should we be.)
For a proper encoding, though, S-expressions have to be encoded in a way that unambiguously distinguishes them from the encodings of other M-expressions. McCarthy's solution was
φ(S) = (QUOTE S).
Now, in some ways this is quite a good solution. It has the virtue of simplicity, cutting the Gordian knot. It preserves the readability of the encoded S-expression, which supports McCarthy's desire for a lucid encoding. The main objection one could raise is that it isn't homomorphic; that is, φ((S1 . S2)) is not built up from φ(S1) and φ(S2).
As McCarthy later recounted, they had expected to have plenty of time to refine the language design before it could be implemented. (The FORTRAN compiler, after all, had been a massive undertaking.) Meanwhile, to experiment with the language they began hand-implementing particular functions. The flaw in this plan was that, because McCarthy had been so successful in demonstrating a universal Lisp function eval with simple logic, it wasn't difficult to hand-implement eval; and, because he had been so successful in making the encoding lucid, this instantly produced a highly usable Lisp interpreter. The sudden implementation precipitated a user community and substantial commitment to specifics of what had been a preliminary language design.
All this might have turned out differently if the preliminary design had allowed first-class functions to be elements in pairs. A function has to be encoded, homomorphically, which would require a homomorphic encoding of pairs, perhaps φ((M1 . M2)) = (CONS φ(M1) φ(M2)); once we allow arbitrary M-expressions within the pair syntax, (M1 . M2), that syntax itself becomes a pair constructor and there's really no need for a separate cons operator in the M-language; then CONS can encode the one constructor. One might then reasonably restrict QUOTE to base cases; more, self-encode () #t and #f, leaving only the case of encoding symbols, and rename QUOTE to SYMBOL. The encoding would then be fully homomorphic — but the encodings of large constant data structures would become unreadable. For example, fairly readable constant structure
((LAMBDA X (X Y)) FOO)would encode through φ as
(CONS (CONS (SYMBOL LAMBDA) (CONS (SYMBOL X) (CONS (CONS (SYMBOL X) (CONS (SYMBOL Y) ())) ()))) (CONS (SYMBOL FOO) ())) .That didn't happen, though.
The homomorphic, non-QUOTE encoding would naturally tend to produce a universal function with no practical potential for meta-programming. In theory, one could use the non-homomorphic QUOTE encoding and still not offer any native meta-programming power. However, the QUOTE-based encoding means there are data structures lying around that look exactly like working executable code except that they happen to be QUOTEd. In practice, the psychology of the notation makes it rather inevitable that various facilities in the language would allow a blurring of the line between data and code. Lisp, I was told when first taught the language in the 1980s, treats code as data. Sic: I was not told Lisp treats data as code, but that it treats code as data.
In other words, Lisp had accidentally become an interpreted language; a profoundly different beast than the compiled language McCarthy had set out to create, and one whose character naturally suggests a whole different set of features that would not have occurred to someone designing a compiled language in 1960. There were, of course, some blunders along the way (dynamic scope is an especially famous one, and I would rate the abandonment of fexprs in favor of macros as another of similar magnitude); but in retrospect I see all that as part of exploring a whole new design space of interpreted features. Except that over the past three decades or so the Lisp community seems to have somewhat lost track of its interpreted roots; but, I'll get back to that.
Of interest:
In S-expression Lisp, all source expressions are S-expressions. It is no less true now than before that an operational equivalence M1 ≅ M2 can only be nontrivial if at least one of the Mk is not an S-expression; but now, if the Mk are source expressions, we can be absolutely certain that they are both S-expressions. So if the operational equivalence relation is restricted to source expressions, it's trivial. This isn't disastrous; it just means that, in order to have nontrivial theory, we are going to have to have some terms that are not source expressions (as Landin did, though for a different reason); and if we choose to compile the language, we won't be allowed to call our optimizations "local source-to-source" (any given optimization could be one or the other, but not both at once).
This is the fork in the road where Muller and I went our separate ways. Muller's M-Lisp, taking the compiled-language view, supposes that S-expressions are encoded homomorphically, resulting in a baseline language with no native meta-programming power. He then considers how to add some meta-programming power to the resulting language. However, practical meta-programming requires the programmer be able to write lucid code that can also be manipulated as data; and the homomorphic encoding isn't lucid. So instead, meta-programming in Muller's extended M-Lisp uses general M-expressions directly (rather than their encodings). If an M-expression turns out to be wanted as data, it then gets belatedly encoded — with the drawback that the M-expression can't be rewritten until the rewriting schemata can tell it won't be needed as data. This causes difficulties with operational equivalence of general M-expressions; in effect, as the burden of meta-programming is shifted from S-expressions to general M-expressions, it carries along with it the operational-equivalence difficulties that had been limited to S-expressions.
McCarthy hadn't finished the details of M-expressions, so S-expression Lisp wasn't altogether an encoding of anything; it was itself, leaving its implementors rather free to invent it as they went along. Blurring the boundary between quoted data and unquoted code provided meta-programming facilities that hadn't been available in compiled languages (essentially, I suggest, a sort of flexibility we enjoy in natural languages). In addition to QUOTE itself (which has a rather fraught history entangled with first-class functions and dynamic scope, cf. §3.3.2 of my dissertation), from very early on the language had fexprs, which are like LAMBDA-constructed functions except that they treat their operands as data — as if the operands had been QUOTEd — rather than evaluating them as code (which may later be done, if desired, explicitly by the fexpr using EVAL). In 1963, macros were added — not the mere template-substitution macros found in various other languages, but macros that treat their operands as data and perform an arbitrary computation to generate a data structure as output, which is then interpreted as code at the point of call.
But how exactly do we specify the meanings of programs in this interpreted S-expression language? We could resort to the meta-circular evaluator technique; this is a pretty natural strategy since an evaluator is exactly what we have as our primary definition of the language. That approach, though, is difficult to work with mathematically, and in particular doesn't lend itself to proofs of operational equivalence. If we try to construct a rewriting system the way we did before, we immediately run into the glaring practical problem that the same representation is used for executable code, which we want to have nontrivial theory, and passive data which necessarily has perfectly trivial theory. That is, as noted earlier, all source expressions are S-expressions and operational equivalence of S-expressions is trivial. It's possible to elaborate in vivid detail the theoretical train wreck that results from naive application of the usual rewriting semantics strategy to Lisp with quotation (or, worse, fexprs); but this seems to be mainly of interest if one is trying to prove that something can't be done. I'm interested in what can be done.
If what you want is nontrivial theory, that could in principle be used to guide optimizations, this is not difficult to arrange (once you know how; cf. my past discussion of profundity index). As mentioned, all nontrivial operational equivalences must have at least one of the two terms not a source expression (S-expression), therefore we need some terms that aren't source expressions; and our particular difficulty here is having no way to mark a source expression unmistakably as code; so, introduce a primitive context that says "evaluate this source expression". The new context only helps with operational equivalence if it's immune to QUOTE, and no source expression is immune to QUOTE, so that's yet another way to see that the new context must form a term that isn't a source expression.
Whereas the syntax of the compiled M-language had two kinds of terms — constant S-expressions and computational M-expressions — the syntax of the interpreted S-language will have three kinds of terms. There are, again, the "constant" terms, the S-expressions, which are now exactly the source expressions. There are the "computational" terms, which are needed for the actual work of computation; these are collectively shaped something like λ-calculus. We expect a big part of the equational strength of the rewriting calculus to reside in these computational terms, roughly the same equational strength as λ-calculus itself, and therefore of course those terms have to be entirely separate from the source expressions which can't have nontrivial equational theory. And then there are the "interpretation" terms, the ones that orchestrate the gradual conversion of source expressions into computational expressions. The code-marking terms are of this sort. The rewriting rules involving these "interpretation" terms will amount to an algorithm for interpreting source code.
This neat division of terms into three groups won't really be as crisp as I've just made it sound. Interpretation is by nature a gradual process whose coordination seeps into other parts of the grammar. Some non-interpretation terms will carry along environment information, in order to make it available for later use. This blurring of boundaries is perhaps another part of the flexibility that (I'll again suggest) makes interpreted languages more similar to natural languages.
I'll use nonterminal T for arbitrary terms. Here are the interpretation forms.
T ::= [eval T T] | ⟨wrap T⟩ | e .Form [eval T1 T2] is a term that stands for evaluating a term T1 in an environment T2. This immediately allows us to distinguish between statements such as
S1 ≅ S2The first proposition is the excessively strong statement that S-expressions Sk are operationally equivalent — interchangeable in any context — which can only be true if the Sk are identically the same S-expression. The second proposition says that evaluating S1 in environment e0 is operationally equivalent to evaluating S2 in environment e0 — that is, for all contexts C and all S-expressions S, C[[eval S1 e0]] ↦* S iff C[[eval S2 e0]] ↦* S. The third proposition says that evaluating S1 in any environment e is operationally equivalent to evaluating S2 in the same e — which is what we would ordinarily have meant, in a compiled language, if we said that two executable code (as opposed to data) expressions Sk were operationally equivalent.
[eval S1 e0] ≅ [eval S2 e0]
∀e, [eval S1 e] ≅ [eval S2 e] .
The second form, ⟨wrap T⟩, is a wrapper placed around a function T, that induces evaluation of the operand passed to the function. If T is used without such a wrapper (and presuming T isn't already a wrapped function), it acts directly on its unevaluated operand — that is, T is a fexpr.
The third form, e, is simply an environment. An environment is a series of symbol-value bindings, ⟪sk←Tk⟫; there's no need to go into gory detail here (though I did say more in a previous post).
The computational forms are, as mentioned, similar to λ-calculus with some environments carried along.
T ::= x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩ .Here we have a variable, a combination, and two kinds of function. Form ⟨λx.T⟩ is a function that substitutes its operand for x in its body T. Variant ⟨εx.T⟩ substitutes its dynamic environment for x in its body T.
Form [combine T1 T2 T3] is a term that stands for combining a function T1 with an operand T2 in a dynamic environment T3. The dynamic environment is the set of bindings in force at the point where the function is called; as opposed to the static environment, the set of bindings in force at the point where the function is constructed. Static environments are built into the bodies of functions by the function constructor, so they don't show up in the grammar. For example, [eval (LAMBDA X FOO) e0] would evaluate to a function with static environment e0, of the form ⟨wrap ⟨λx.[eval FOO ⟪...⟫]⟩⟩ with the contents of e0 embedded somewhere in the ⟪...⟫.
Putting it all together,
T ::= x | [combine T T T] | ⟨λx.T⟩ | ⟨εx.T⟩ | [eval T T] | ⟨wrap T⟩ | e | S .
The rewriting schemata naturally fall into two groups, those for internal computation and those for source-code interpretation. (There are of course no schemata associated with the third group of syntactic forms, the syntactic forms for passive data, because passive.) The computation schemata closely resemble λ-calculus, except with the second form of function used to capture the dynamic environment (which fexprs sometimes need).
[combine ⟨λx.T1⟩ T2 T3] → T1[x ← T2]The interpretation schemata look very much like the dispatching logic of a Lisp interpreter.
[combine ⟨εx.T1⟩ T2 T3] → [combine T1[x ← T3] T2 T3] .
[eval d T] → d(There would also be some atomic constants representing primitive first-class functions and reserved operators such as if, and schemata specifying what they do.) Half a century's worth of misunderstandings and confusions
if d is an empty list, boolean, λ-function, ε-function, or environment
[eval s e] → lookup(s,e) if symbol s is bound in environment e
[eval (T1 T2) T3] → [combine T1 T2 T3]
[combine ⟨wrap T1⟩ T2 T3] → [combine T1 [eval T2 T3] T3] .
As I remarked earlier, Lisp as we know it might not have happened — at least, not when and where it did — if McCarthy had thought to allow first-class functions to occur in pairs. The thing is, though, I don't think it's all that much of an "accident". He didn't think to allow first-class functions to occur in pairs, and perhaps the reason we're likely to think to allow them today is that our thinking has been shaped by decades of the free-wheeling attitude fostered by the language that Lisp became because he didn't think to then. The actual sequence of events seems less unlikely than one might first suppose.
Researchers trying to set up semantics for Lisp have been led astray, persistently over the decades, by the fact that the primary Lisp constructor of first-class functions is called LAMBDA. Its behavior is not that of calculus λ, exactly because it's entangled with the process of interpreting Lisp source code. This becomes apparent when contemplating rewriting calculi for Lisp of the sort I've constructed above (and have discussed before on this blog): When you evaluate a LAMBDA expression you get a wrapped function, one that explicitly evaluates its operand and then passes the result to a computational function; that is, passes the result to a fexpr. Scan that: ordinary Lisp functions do not correspond directly to calculus λ-abstractions, but fexprs do correspond directly to calculus λ-abstractions. In its relation to Lisp, λ-calculus is a formal calculus of fexprs.
Much consternation has also been devoted to the perceived theoretical difficulty presented by Lisp's quotation operator (and presented in more extreme form by fexprs), because it presents a particular context that can distinguish any two S-expressions placed into it: (QUOTE S1) and (QUOTE S2) are observably distinct whenever the Sk are distinct from each other. Yet, this observation only makes sense in a compiled programming language. Back in the day, it would have been an unremarkable observation that Lisp only has syntax for data structures, no syntax at all for control. Two syntactically distinct Lisp source expressions are operationally non-equivalent even without any quotation or fexpr context, because they don't represent programs at all; they're just passive data structures. The context that makes a source expression code rather than data is patently not in the source; it's in what program you send the source expression to. Conventional small-step operational semantics presumes the decision to compile, along with a trivial interpretive mapping between source expressions and internal computational terms (so the interpretive mapping doesn't have to appear explicitly in the rewriting schemata). It is true that, without any such constructs as quotation or fexprs, there would be no reason to treat the language as interpreted rather than compiled; but once you've crossed that Rubicon, the particular constructs like quotation or fexprs are mere fragments of, and can be distractions from, the main theoretical challenge of defining the semantics of an interpreted language.
The evolution of Lisp features has itself been a long process of learning how best to realize the flexibility offered by interpreted language. Fexprs were envisioned just about from the very beginning — 1960 — but were sabotaged by dynamic scope, a misfeature that resulted from early confusion over how to handle symbol bindings in an interpreter. Macros were introduced in 1963, and unlike fexprs they lend themselves to preprocessing at compile time if one chooses to use a compiler; but macros really ought to be much less mathematically elegant... except that in the presence of dynamic scope, fexprs are virulently unstable. Then there was the matter of first-class functions; that's an area where Lisp ought to have had a huge advantage; but first-class functions don't really come into their own without static scope (The Art of the Interpreter noted this) — and first-class functions also present a difficulty for compilers (which is why procedures in ALGOL were second-class). The upshot was that after Lisp 1.5, when Lisp splintered into multiple dialects, first-class functions went into eclipse until they reemerged in the mid-1970s when Scheme introduced static scope into the language. Fexprs held on longer but, ironically, were finally rejected by the Lisp community in 1980 — just a little ahead of the mainstream adoption of Scheme's static-scope innovation. So for the next twenty years and more, Lisp had static scope and first-class functions, but macros and no fexprs. Meanwhile, EVAL — key workhorse of meta-linguistic flexibility — was expunged from the new generation of mainstream Lisps and has had great difficulty getting back in.
The latter half of Lisp's history has been colored by a long-term trend in programming language design as a whole. I've alluded to this several times above. I have no specific sources to suggest for this; it's visible in the broad sweep of what languages were created, what research was done, and I've sensed it though my interactions with the Lisp community over the past thirty years. When I learned Lisp in the mid-1980s, it was from the Revised Maclisp Manual, Saturday Evening Edition (which I can see a few feet away on my bookshelf as I write this, proof that manuals can be well-written). Maclisp was a product of the mentality of the 1970s. Scheme too was a product of that mentality. And what comes through to me now, looking back, isn't the differences between those languages (different though they are), but that those people knew, gut deep, that Lisp is an interpreted language — philosophically, regardless of the technical details of the language processing software. The classic paper I cited above for the relationship between first-class functions and static scope was one of the "lambda" papers associated with the development of Scheme: "The Art of the Interpreter". The classic textbook — the Wizard Book — that emerged from the Scheme design is Structure and Interpretation of Computer Programs.
But then things changed. Compilation had sometimes intruded into Lisp design, yes (with unfortunate results, as I've mentioned), but the intrusion became more systematic. Amongst Scheme's other achievements it had provided improved compilation techniques, a positive development but which also encouraged greater focus on the challenges of compilation. We refined our mathematical technology for language semantics of compiled languages, we devised complex type systems for use with compiled languages, more and more we designed our languages to fit these technologies — and as Lisp didn't fit, more and more we tweaked our Lisp dialects to try to make them fit. Of course some of the indigenous features of Lisp couldn't fit, because the mathematical tools were fundamentally incompatible with them (no pun intended). And somewhere along the line, somehow, we forgot, perhaps not entirely but enough, that Lisp is interpreted. Second-class syntax has lately been treated more and more as if it were a primary part of the language, rather than a distraction from the core design. Whatever merits such languages have, wholeheartedly embracing the interpretation design stance is not among them.
I'm a believer in trying more rather than less. I don't begrudge anyone their opportunity to follow the design path that speaks to them; but not all those paths speak to me. Second-class syntax doesn't speak to me, nor recasting Lisp into a compiled language. I'm interested in compiling Lisp, but want the language design to direct those efforts rather than the other way around. To me, the potential of interpretation beckons; the exciting things we've already found on that path suggest to me there's more to find further along, and the only way to know is to follow the path and see. To do that, it seems to me we have to recognize that it is a distinct path, the distinct philosophy of interpretation; and, in company with that, we need to hone our mathematical technology for interpreted languages.
These are your father's parentheses
Elegant weapons
for a more... civilized age.— xkcd 297.
Saturday, March 1, 2014
Continuations and term-rewriting calculi
Thinking is most mysterious, and by far the greatest light upon it that we have is thrown by the study of language.— Benjamin Lee Whorf.
In this post, I hope to defuse the pervasive myth that continuations are intrinsically a whole-computation device. They aren't. I'd originally meant to write about the relationship between continuations and delimited continuations, but find that defusing the myth is prerequisite to the larger discussion, and will fill up a post by itself.
To defuse the myth, I'll look at how continuations are handled in the vau-control-calculus. Explaining that calculus involves explaining the unconventional way vau-calculi handle variables. So, tracing back through the tangle of ideas to find a starting point, I'll begin with some remarks on the use of variables in term-rewriting calculi.
While I'm extracting this high-level insight from the lower-level math of the situation, I'll also defuse a second common misapprehension about continuations, that they are essentially function-like. This is a subtle point: continuations are invoked as if they were functions, and traditionally appear in the form of first-class functions, but their control-flow behavior is orthogonal to function application. This point is (as I've just demonstrated) difficult even to articulate without appealing to lower-level math; but it arises from the same lower-level math as the point about whole-computation, so I'll extract it here with essentially no additional effort.
ContentsPartial-evaluation variables
Partial-evaluation variables
Continuations by global rewrite
Control variables
Insights
From the lasting evidence, Alonzo Church had a thing about variables. Not as much of a thing as Haskell Curry, who developed a combinatorial calculus with no variables at all; but Church did feel, apparently, that a meaningful logical proposition should not have unbound variables in it. He had an elegant insight into how this could be accomplished: have a single binding construct —which for some reason he called λ— for the variable parameter in a function definition, and then —I quite enjoyed this— you don't need additional binding constructs for the existential and universal quantifiers, because you can simply make them higher-order functions and leave the binding to their arguments. For his quantifiers Π and Σ, Π(F,G) meant for all values v such that F(v) is true, G(v) is true; and Σ(F) meant there exists some value v such that F(v) is true. The full elegance of this was lost because only the computational subset of his logic survived, under the name λ-calculus, so the quantifiers fell by the wayside; but the habit of a single binding construct has remained.
In computation, though, I suggest that the useful purpose of λ-bound variables is partial evaluation. This notion dawned on me when working out the details of elementary vau-calculus. Although I've blogged about elementary vau-calculus in an earlier post, there I was looking at a different issue (explicit evaluation), and took variables for granted. Suppose, though, that one were centrally concerned only with capturing the operational semantics of Lisp (with fexprs) in a term-rewriting calculus at all, rather than capturing it in a calculus that looks as similar as possible to λ-calculus. One might end up with something like this:
T ::= S | s | (T . T) | [wrap T] | AMost of this is the same as in my earlier post (explicit evaluation), but there are three differences: the internal structure of environments (e) is described; operatives have a different structure, which is fully described; and there are no variables.
A ::= [eval T T] | [combine T T T]
S ::= d | () | e | [operative T T T T]
e ::= ⟪ B* ⟫
B ::= s ← T
Wait. No variables?
Here, a term (T) is either a self-evaluating term (S), a symbol (s), a pair, an applicative ([wrap T], where T is the underlying combiner), or an active term (A). An active term is the only kind of term that can be the left-hand side of a rewrite rule: it is either a plan to evaluate something in an environment, or a plan to invoke a combiner with some operands in an environment. A self-evaluating term is either an atomic datum such as a number (d), or nil, or an environment (e), or an operative — where an operative consists of a parameter tree, an environment parameter, a body, and a static environment. An environment is a delimited list of bindings (B), and a binding associates a symbol (s) with an assigned value (T).
The rewrite rules with eval on the left-hand side are essentially just the top-level logic of a Lisp evaluator:
[eval S e] → SThat leaves two rules with combine on the left-hand side: one for combining an applicative, and one for combining an operative. Combining applicatives is easy:
[eval s e] → lookup(s,e) if lookup(s,e) is defined
[eval (T1 . T2) e] → [combine [eval T1 e] T2 e]
[eval [wrap T] e] → [wrap [eval T e]]
[combine [wrap T0] (T1 ... Tn) e] → [combine T0 ([eval T1 e] ... [eval Tn e]) e]Combining operatives is a bit more complicated. (It will help if you're familiar with how Kernel's
$vau works; see here.) The combination is rewritten as an evaluation of the body of the operative (its third element) in a local environment. The local environment starts as the static environment of the operative (its fourth element); then the ordinary parameters of the operative (its first element) are locally bound to the operands of the combination; and the environment parameter of the operative (its second element) is locally bound to the dynamic environment of the combination.
[combine [operative T1 T2 T3 T4] V e] → [eval T3 match(T1,V) · match(T2,e) · T4] if the latter is definedwhere match(X,Y) constructs an environment binding the symbols in definiend X to the corresponding subterms in Y; echild · eparent concatenates two environments, producing an environment that tries to look up symbols in echild, and failing that looks for them in eparent; and a value (V) is a term such that every active subterm is inside a self-evaluating subterm.
Sure enough, there are no variables here. This calculus behaves correctly. However, it has a weak equational theory. Consider evaluating the following two expressions in a standard environment e0.
[evalClearly, these two expressions are equivalent; we can see that they are interchangeable. They both construct an applicative that takes one numerical argument and returns it unchanged. However, the rewriting rules of the calculus can't tell us this. These terms reduce to($lambda (x) (+ 0 x))e0]
[eval($lambda (x) (* 1 x))e0]
[wrap [operativeand both of these terms are irreducible! Whenever we call either of these combiners, its body is evaluated in a local environment that's almost like e0; but within the calculus, we can't even talk about what will happen when the body is evaluated. To do so we would have to construct an active evaluation term for the body; to build the active term we'd need to build a term for the local environment of the call; and to build a term for that local environment, we'd need to bind(x)#ignore(+ 0 x)e0]]
[wrap [operative(x)#ignore(* 1 x)e0]]
x to some sort of placeholder, meaning "some term, but we don't know what it is yet".
A variable is just the sort of placeholder we're looking for. So let's add some syntax. First, a primitive domain of variables. We call this domain xp, where the "p" stands for "partial evaluation", since that's what we want these variables for (and because, it turns out, we're going to want other variables that are for other purposes). We can't put this primitive domain under nonterminal S because, when we find out later what a variable stands for, what it stands for might not be self-evaluating; nor under nonterminal A because what it stands for might not be active. So xp has to go directly under nonterminal T.
T ::= xpWe also need a binding construct for these variables. It's best to use elementary devices in the calculus, to give lots of opportunities for provable equivalences, rather than big monolithic devices that we'd then be hard-put to analyze. So we'll use a traditional one-variable construct, and expect to introduce other devices to parse the compound definiends that were handled, in the variable-less calculus, by function match.
S ::= ⟨λ xp.T⟩governed by, essentially, the usual β-rule of λ-calculus:
[combine ⟨λ xp.T⟩ V e] → T[xp ← V]That is, combine a λ-expression by substituting its operand (V) for its parameter (xp) in its body (T). Having decided to bind our variables xp one at a time, we use three additional operative structures to deliver the various parts of the combination one at a time (a somewhat souped-up version of currying): one structure for processing a null list of operands, one for splitting a dotted-pair operand into its two halves, and one for capturing the dynamic environment of the combination.
S ::= ⟨λ0.T⟩ | ⟨λ2.T⟩ | ⟨ε.T⟩The corresponding rewrite rules are
[combine ⟨λ0.T⟩ () e] → T
[combine ⟨λ2.T0⟩ (T1 . T2) e] → [combine [combine T0 T1 e] T2 e]
[combine ⟨ε.T0⟩ T1 e] → [combine [combine T0 e ⟪⟫] T1 e]
Unlike the variable-less calculus, where the combine rewrite rule initiated evaluation of the body of an operative, here evaluation of the body must be built into the body when the operative is constructed. This would be handled by the δ-rules (specialized operative-call rewrite rules) for evaluating function definitions. For example, for variables x,y and standard environment e0,
[evalVariable y is a dummy-variable used to discard the dynamic environment of the call, which is not used by ordinary functions. Variable x is our placeholder, in the constructed term to evaluate the body, for the unknown operand to be provided later.($lambda (x) (+ 0 x))e0] →+ [wrap ⟨ε.⟨λy.⟨λ2.⟨λx.⟨λ0.[eval(+ 0 x)⟪x← x⟫ · e0]⟩⟩⟩⟩⟩]
The innermost redex (reducible expression) here, [eval (+ 0 x) ⟪x ← x⟫ · e0], can be rewritten through a series of steps,
[evalWhere we can go from here depends on additional information of one or another kind. We may have a rule that tells us the addition operative + doesn't use its dynamic environment, so that we can garbage-collect the environment,(+ 0 x)⟪x← x⟫ · e0]
  → [combine [eval+⟪x← x⟫ · e0](0 x)⟪x← x⟫ · e0]
  → [combine [wrap +](0 x)⟪x← x⟫ · e0]
  →+ [combine +(0 x)⟪x← x⟫ · e0]
  → [combine +If we have some contextual information that the value of x will be numeric, and a rule that zero plus any number is that number back again, we'd have(0 x)⟪⟫]
  → xAt any rate, we only have the opportunity to even start the partial evaluation of the body, and contemplate these possible further steps, because the introduction of variables allowed us to write a term for the partial evaluation in the first place.
[edit: I'd goofed, in this post, on the combination rule for λ0; it does not of course induce evaluation of T. Fixed now.]Continuations by global rewrite
The idea of using λ-calculus to model programming language semantics goes back at least to Peter Landin in the early 1960s, but there are a variety of programming language features that don't fit well with λ-calculus. In 1975, Gordon Plotkin proposed a remedy for one of these features — eager argument evaluation, whereas ordinary λ-calculus allows lazy argument evaluation and thereby has different termination properties. Plotkin designed a variant calculus, the λv-calculus, and proved that on one hand λv-calculus correctly models the semantics of a programming language with eager argument evaluation, while on the other hand it is comparably well-behaved to traditional λ-calculus. Particularly, the calculus rewriting relation is compatible and Church-Rosser, and satisfies soundness and completeness theorems relative to the intended operational semantics. (I covered those properties and theorems a bit more in an earlier post.)
In the late 1980s, Matthias Felleisen showed that a technique similar to Plotkin's could be applied to other, more unruly kinds of programming-language behavior traditionally described as "side-effects": sequential control (continuations), and sequential state (mutable variables). This bold plan didn't quite work, in that he had to slightly weaken the well-behavedness properties of the calculus. In both cases (control and state), the problem is to distribute the consequences of a side-effect to everywhere it needs to be known; and Felleisen did this by having special constructs that would "bubble up" through the term, carrying the side-effect with them, until they encompassed the whole term, at which point there would be a whole-term rewriting rule to distribute the side-effect to everywhere it needed to go. The whole-term rewriting rules were the measure by which the well-behavedness of the calculus would fail, as whole-term rewriting isn't compatible.
For sequential control (our central interest here), Felleisen added two operators, C and A, to λv-calculus. The syntax of λv-calculus, before the addition, is just that of λ-calculus:
T ::= x | (λx.T) | (TT)In place of the classic β-rule of λ-calculus, λv-calculus has βv, which differs in that the operand in the rule is a value (redexes have to be inside λ-terms):
((λx.T)V) → T[x ← V]The operational semantics, which acts only on whole terms, uses (per Felleisen) an evaluation context E to uniquely determine which subterm is reduced:
E ::= ⎕ | (ET) | ((λx.T)E)
E[((λx.T)V)] ↦ E[T[x ← V]]For the control calculus, the term syntax adds A and C,
T ::= x | (λx.T) | (TT) | (AT) | (CT)Neither of these operators has the semantics of call-with-current-continuation. Instead, (AT) means "abort the surrounding computation and just do T", while (CT) means "abort the surrounding computation and apply T to the (aborted) continuation". Although it's possible to build conventional call-with-current-continuation out of these primitive operators, the primitives themselves are obviously intrinsically whole-term operators. Operationally, evaluation contexts don't change at all, and the operational semantics has additional rules
E[(AT)] ↦ TThe compatible rewrite relation, →, has rules to move the new operators upward through a term until they reach its top level. The compatible rules for A are dead easy:
E[(CT)] ↦ T(λx.(AE[x])) for unused variable x
(AT1)T2 → AT1Evidently, though, once the A operator reaches the top of the term, the only way to get rid of it, so that computation can proceed, is a whole-term rewrite rule,
V(AT) → AT
AT ᐅ TThe whole-term rule for C is easy too,
CT ᐅ T(λx.(Ax))but the compatible rewrite rules for C are, truthfully, just a bit frightening:
(CT1)T2 → C(λx1.(T1(λx2.(A(x1(x2T2)))))) for unused xkThis does produce the right behavior (unless I've written it out wrong!), but it powerfully perturbs the term structure; Felleisen's description of this as "bubbling up" is apt. Imho, it's quite a marvelous achievement, especially given the prior expectation that nothing of the kind would be possible — an achievement in no way lessened by what can now be done with a great deal of hindsight.
V(CT) → C(λx1.(T(λx2.(A(x1(Vx2)))))) for unused xk
The perturbation effect appears to me, in retrospect, to be a consequence of describing the control flow of continuations using function-application structure. My own approach makes no attempt to imitate function-application and, seemingly as a result, its constructs move upward without the dramatic perturbation of Felleisen's C.
Various constraints can be tampered with to produce more well-behaved results. Felleisen later proposed to adjust the target behavior — the operational semantics — to facilitate well-behavedness, in work considered formative for the later notion of delimited continuations. The constraint I've tampered with isn't a formal condition, but rather a self-imposed limitation on what sort of answers can be considered: I introduce a new binding construct whose form doesn't resemble λ, and whose rewriting rules use a different substitution function than the β-rule.
Control variablesConsider the following Scheme expression:
(call/cc (lambda (c) (c 3)))Presuming this is evaluated in an environment with the expected binding for call/cc, we can easily see it is operationally equivalent to 3. Moreover, our reasoning to deduce this is evidently local to the expression; so why should our formalism have to rewrite the whole surrounding term (perturbing it in the process) in order to deduce this?
Suppose, instead of Felleisen's strategy of bubbling-up a side-effect to the top level of a term and then distributing it from there, we were to bubble-up (or, at least, migrate up) a side-effect to some sort of variable-binding construct, and then distribute it from there by some sort of substitution function to all free occurrences of the variable within the binding scope. The only problem, then, would be what happens if the side-effect has to be distributed more widely than the given scope — such as if a first-class continuation gets carried out of the subterm in which it was originally bound — and that can be solved by allowing the binding construct itself to migrate upward in the term, expanding its scope as much as necessary to encompass all instances of the continuation.
I did this originally in vau-calculus, of course, but for comparison with Felleisen's A/C, let's use λv-calculus instead. Introduce a second domain of control variables, xc, disjoint from xp, and "catch" and "throw" term structures (κx.T) and (τx.T).
T ::= xp | (TT) | (λxp.T) | (κxc.T) | (τxc.T)Partial-evaluation variables are bound by λ, control variables are bound by κ (catch). Control variables aren't terms; they can only occur free in τ-expressions, where they identify the destination continuation for the throw. κ and τ are evaluation contexts; that is,
E ::= ... | (κx.E) | (τx.E)
The rewrite rules for τ are pretty much the same as for Felleisen's A, except that there is now a compatible rewrite rule for what to do once the throw reaches its matching catch, rather than a whole-term rewrite for eliminating the A once it reaches the top of the term.
(τx.T1)T2 → τx.T1What about rewrite rules for κ? One simple rule we need, in order to relate expressions with κ to expression without, is "garbage collection":
V(τx.T) → τx.T
κx.(τx.T) → κx.T
κx.T → T if x does not occur free in TWe also want rules for κ to migrate upward —non-destructively— when it occurs in an evaluation context; but κ may be the target of matching τ expressions, and if we move the κ without informing a matching τ, that τ will no longer do what it was meant to. Consider a κ, poised to move upward, with a matching τ somewhere in its body (embedded in some context C that doesn't capture the control variable).
V(κx.C[τx.T])If C happens to be an evaluation context, then it is possible for the τ to move upward to meet the κ and disappear; and, supposing x doesn't occur free in T, we'd have (VT). Even if C isn't an evaluation context, (τx.T) thus represents the potential to form (VT). If we move the κ over the V, then, in order for the τ to still represent the same potential it did before, we'd have to change it to (τx.(VT)). And this has to happen for every matching τ. So let's fashion a substitution function for control variables, T[x ← C] where C doesn't capture any variables:
y[x ← C] → yThe "where" conditions are met by α-renaming as needed. Now we're ready to write our rewrite rules for moving κ upward:
(T1T2)[x ← C] → ((T1[x ← C])(T2[x ← C]))
(λy.T)[x ← C] → (λy.(T[x ← C])) where y isn't free in C
(κy.T)[x ← C] → (κy.(T[x ← C])) where y isn't x or free in C
(τy.T)[x ← C] → (τy.(T[x ← C])) if y isn't x
(τx.T)[x ← C] → (τx.C[T[x ← C]])
(κx.T1)T2 → κx.(T1[x ← ⎕T2] T2) where x isn't free in T2As advertised, κ moves upward without perturbing the term structure (contrast with the bubbling-up rules for C). If we need a first-class continuation, we can simply wrap τ in λ: (λy.(τx.y)). The δ-rule for call/cc would be
V(κx.T) → κx.(V(T[x ← V⎕])) where x isn't free in V
κy.(κx.T) → κy.(T[x ← y])
(call/cc T) → κx.(T(λy.(τx.y))) for unused xIf this occurs in some larger context, and the first-class continuation escapes into that larger context, then the matching κ will have had to move outward before it over some evaluation context E, and substitutions will have transformed the continuation to (λy.(τx.E[y])). Insights
The orthogonality of continuation control-flow to function application is, in the lower-level math, rather explicitly demonstrated by the way κ moves smoothly upward through the term, in contrast to the perturbations of the bubbling-up rules for C as it forcibly interacts with function-application structure. The encapsulation of a τ within a λ to form a first-class continuation seals the deal.
The notion that continuations are a whole-term phenomenon —or, indeed, that any side-effect is a whole-term phenomenon— breaks down under the use of floating binding-constructs such as κ, which doesn't require the side-effect to remain encapsulated within a particular subterm, but does allow it to do so and thus allows local reasoning about it to whatever extent its actual behavior remains local. Whether or not that makes traditional continuations "undelimited" is a question of word usage: the κ binding-construct is a delimiter, but a movable one.
As a matter of tangential interest here, the vau-calculus handling of sequential state involves two new kinds of variables and four new syntactic constructs (two of which are binders, one for each of the new kinds of variables). Here's a sketch: Mutable data is contained in symbol-value assignments, which in turn are attached to environments; the identity of an environment is a variable, and its binding construct defines the region of the term over which that environment may be accessible. Assignments are a separate, non-binding syntactic construct, which floats upward toward its matching environment-binding. When a symbol is evaluated in an environment, a pair of syntax elements are created at the point of evaluation: a query construct to seek an assignment for the given symbol in the given environment, which binds a query-variable, and within it a matching result construct with a free occurrence of the query-variable. The result construct is an indivisible term. The query is a compound, which migrates upward through the term looking for an assignment for the symbol in the environment. When the query encounters a matching assignment, the query is annihilated (rather as a τ meeting its matching κ), while performing a substitution that replaces all matching result-constructs with the assigned value (there may by this time be any number of matching result-constructs, since the result of the original lookup may have been passed about in anticipation of eventually finding out what its value is).
As a final, bemusing note, there's a curious analogy (which I footnoted in my dissertation) between variables in the full side-effectful vau-calculus, and fundamental forces in physics. The four forces of nature traditionally are gravity, electromagnetism, strong nuclear force, and weak nuclear force; one of these —gravity— is quite different from the others, with a peculiar sort of uniformity that the others lack (gravity is only attractive). Whilst in vau-calculus we have four kinds of variables (partial-evaluation, control, environment, and query), of which one —partial-evaluation— is quite different from the others, with a peculiar sort of uniformity that the others lack (each of the other kinds of variable has an α-renaming substitution to maintain hygiene, and one or more separate kinds of substitution to aid its purpose in rewriting; but partial-evaluation variables have only one kind of substitution, of which α-renaming is a special case).
[Note: I later explored the physics analogy in a separate post, here.]