Lambda the Ultimate - Comments

My preferred design choices

Bruce J. Bell — 2025年11月04日 23:32:04 +0000

Yes, Unicode kills the "contiguous memory" string model. However, a lot of string manipulation is search-and-replace which will change the length of your string anyway; I'm not sure how much is lost by switching the likes of tr// to copy instead of mutate. If you really want, you can recover Boyer-Moore type algorithms (i.e., search that relies on skipping a known number of array values in O(1) time) by picking (say) UTF8 and working with the bytes. Anyway, I'm pretty sure you can do all practical string processing with decent efficiency by reading strings in order. My preferred design choice is to design the string library to support immutable strings with in-order string processing, rather than try to recover the mysterious lost era of ASCII. I want my immutable strings to concatenate efficiently by reference, so yes: they will be backed (where necessary) with slow, tree-like datastructures. However, trees are only needed for large strings, which can be stored as fairly large chunks, so most of your time can be spent processing bytes, rather than wrangling trees. All this is independent of "what even is a character, bro?". I admit that Unicode pedants who make an issue that the term "character" is ill-defined in the context of Unicode have a point, but I no longer find their policy prescription to instead do everything with grapheme clusters, or extended grapheme clusters, or whatever becomes fashionable in future editions of Unicode, at all persuasive. The fact is, codepoints are the only Unicode entity suitable to use as a basis for string processing. Mandating a representation is problematic. Clusters of various sorts are locale-dependent, and in any case are not stable between Unicode editions. So, my string library will support processing strings by cluster, but those clusters will just be substrings (made up of codepoints). And it should be fully agnostic with respect to how those clusters are chosen.

Delimited continuations don't parallelize well.

Ray Dillinger — 2025年11月04日 02:39:34 +0000

I have never seen a system with continuations that wasn't also single-threaded in all scopes where those continuations could be invoked. As you say, it's hard to imagine how to take advantage of parallelism (of ANY kind, not just I/O) with that paradigm. I'm working on something where procedures can "yield" flow of control back to the caller, giving the caller a re-entry point so that the caller can "re-enter" if the function ought to resume right after it yielded. This is a very restricted form of delimited continuation that I hope will parallelize a bit better, but it has some significant limitations. First, the caller has to declare a "function environment" and pass it to the re-enterable function. That environment goes out of scope when the caller's environment goes out of scope. The reason for this is so that parallel users of the re-entrant function can use separate environments allowing access in parallel and guarantee that they get cleaned up when the program exits a scope where they could be called. Second, the re-enterable function has to treat every "yeild" as though it might be "exit," meaning it never does stuff outside its own environment (which the caller is keeping for it) that would need cleaning up after "yeilding." Third, Function environments can't be returned on the stack as they can in most languages with continuations. They can be (and are) passed down the stack as arguments to re-enterable functions (or to functions that call re-enterable functions), but they exist strictly within the call tree of whatever allocated them and whatever housekeeping they do when they go out of scope gets done when the environment where they were allocated goes out of scope. This is so "delimited" that I don't even know if you can really refer them as continuations, but I think they will parallelize more easily than anything you really could refer to as continuations.

With Unicode, Character as a type is in fact meaningless.

Ray Dillinger — 2025年11月02日 21:58:46 +0000

That was my conclusion after reviewing Unicode. There is a meaningful character type when you can guarantee that characters take a constant amount of space. In Unicode, the things we think of as characters - ie, "grapheme clusters" take various numbers of codepoints to represent. Further, in most of Unicode's normalization forms, individual codepoints take a different number of bits to represent depending on their value. You can't just random-access the 20th character in a string by relying on characters having a constant length; you actually have to count grapheme boundaries from the beginning of the buffer, making an O(1) operation into an O(n) operation where n is the size of the string. There is a meaningful character type when you can guarantee that equality in representation bits is isomorphic to equality in the view of users. But in Unicode there are multiple ways to make the same canonical character. There is a meaningful character type when you can guarantee that "normal" operations on characters do not change the space needed for representation. But in Unicode changing case can change the number of codepoints required to represent a character. This problem is made worse by multiple representation types because even character transformations that are one-to-one measured in codepoints result in the storage length of strings changing in all normalization forms except UTF32. This annihilates the "contiguous memory" model of string handling - if you try storing a Unicode string in contiguous memory, changing the capitalization of a character can require copying the entire remainder of the string to move each codepoint after it by one, two, or seven positions toward the front or back. After you had to count from the front to even find the indexed character. Literally every routine O(1) operation, if implemented using "contiguous memory", becomes an O(n) operation in the length of the string. Once you're dealing with Unicode, character sequences aren't "strings" any more in terms of data storage; at best they're trees, with each node of the tree giving the number of grapheme boundaries and the number of newlines (did we mention there are at least four different newline characters in Unicode?) in the subsequence managed by that tree node. Lots of O(1) operations become O(logN) operations. To be fair, that includes some operations that were O(N) under the contiguous storage model, like inserting a sequence in the middle of the string. But on average, this is a massive dump on efficiency. There is a meaningful character type when you can guarantee that splitting strings yields two strings with "length" in characters equal to the length of the original string. But in Unicode you can split strings and rejoin partial string in ways that change the total number of characters, in ways that divide characters on codepoint boundaries instead of character boundaries, and in ways that leave an equal number of codepoints on each side of the split but an unequal number of characters, or vice versa. Every operation has to count grapheme boundaries and every substring produced by any operation has to be eager-normalized in order for subsequent operations to have consistent semantics. There is a meaningful character type when you can give a definite answer to how much space a string of N characters will take to represent. In Unicode you cannot. By restricting yourself to a subset of Unicode you can impose a maximum, but character set is literally infinite. Any restriction will result in an infinite number of characters that cannot be represented. There is a meaningful character type when you can iterate exhaustively over subranges. In Unicode you can't do that. Codepoints have to be checked against a map to see whether they have even been assigned a meaning, many codepoints that do have a meaning aren't meaningful in context depending on their combining class, and the set of valid "grapheme clusters" can't even be enumerated. So, at best, iteration can only be approximated. No matter how you manage it you're going to have if/thens that slow your code down by orders of magnitude by causing instruction cache misses, maps hanging around putting pressure on your data cache, and lots of different code paths that constantly need to be swapped in and out of the instruction cache. Even approximating an iteration over, say, 100 Unicode characters is going to take 200 times as long as iterating over 100 characters of a fixed character set. So, no. There is no meaningful character type any more with Unicode strings. And text is no longer a "string" or array of characters. At best, you can have a "text" type, represented as string segments managed in a tree structure with overhead to keep track of boundaries, and achieve LogN performance (times an achingly huge constant) on most operations. And individual characters? Uh, maybe you could leave off the tree nodes, but you're still going to have overhead to keep track of how many codepoints there are in it, keep in mind that the number of codepoints required to represent a character could change with any operation, and keep in mind that the number of CHARACTERS can change with a case change. For example when you lowercase an eszett, you get two small 's' characters, meaning either your "Character" suddenly becomes "Text", or you give up supporting case operations on individual characters. You might as well have a tree-node above it since that's exactly the kind of overhead why you need tree-nodes to handle in longer text. So. Nope. With Unicode there's no such thing as a character type, or at least no advantage in representing it or handling it differently from handling a sequence of characters. And sequences of characters, likewise, can't stored as strings/arrays any more either. Now they are just "Text", they have a hideous overhead compared to "Strings," and "Character" is just a name for "Text" that happens to have a length of one.

With Unicode, Character as a type is in fact meaningless.

Ray Dillinger — 2025年11月02日 21:43:59 +0000

Duplicate post redacted, please ignore.

...people claim about C++

shap — 2025年10月28日 20:55:27 +0000

...people claim about C++ that it imposes no overhead greater than C code.

People claim lots of things they have never measured. Even things that nobody actually involved in the creation of C++ ever claimed. At most, they claimed comparable performance for code that did comparable things. The thing that broke that claim was exception handling, which has non-local overheads. At best, the claim for exception handling was that you wouldn't execute exception handling code for exceptions unless exceptions were thrown. Unfortunately you still pay in i-cache residency unless code placement in the binary is very clever, and that's actually quite hard to do. Exceptions also make quite a mess in the type system. The late arrival of templating in C++ and Go was unfortunate. The absence meant that neither language got the option to do return-or-error cleanly. The pressure to keep C++ compatible with C was probably excessive, and it left Bjarne so buried that there wasn't time for incorporating templates or algebraic types early. Unfortunately, they are hard to introduce after the fact.

Terminology counts

shap — 2025年10月28日 20:38:35 +0000

It's even more confusing if we don't use clear terminology. For example, characters are made up of code points, not characters. Semantically, the choice of encoding (UTF-8 etc) is irrelevant. The thing that makes UTF-8 interesting is mainly the pun with ASCII, which only works if you know that the constituent letters fall within the ASCII range. The selected encoding changes total memory consumption dramatically, but that's orthogonal to type. In unicode, the thing that breaks glyphs (not characters) as a type is the combining constructs and the need for canonicalization. Either one, or both, makes it clear that strings are not a "sort", at which point their meaning as a ground type becomes hard to explain. This is part of why so many programming languages and runtimes take care to specify a canonical normalization. Which helps, but is not sufficient. If you somehow get past that, you can model strings of glyphs as a tumbler space (floating point is a large but length-bounded tumbler space).

Terminology counts

shap — 2025年10月28日 20:37:48 +0000

Yeah, been there...

Ray Dillinger — 2025年10月07日 04:24:24 +0000

Good point; I have rolled OO in C myself, both directly (in code that the program I was working on would call) and indirectly (in implementing compilers and interpreters for languages with OO features). My general opinion of OO is that it's frequently a useful technique. No matter whether you're programing in an "OO language" or not. If you as an individual programmer want to use an OO style, you can do it in C, or Pascal, or C, or Scheme with only marginally greater difficulty than you can do OO in languages made for it. The main benefit of OO languages is not for individual programmers, but for organizations and large teams. They serve as a way to keep programmers working on different functionality in the same project out of each others' hair to the greatest degree possible. If you can point twenty different engineers at twenty different problems and they can go off and work on them effectively with only a few meetings, a dozen or so emails a day, and occasional phone calls, your code organization is a success. OO features being standardized in OO languages is a good way to facilitate that kind of organizational success. A standard set of OO features provides a reasonably general way to make the interaction of different parts of the code work according to an easily-understood, knowable paradigm. And if that's shared knowledge, the engineers no longer have to discuss it all that much. Also objects having properties like "Private" which the compiler can enforce, prevents a lot of wasted bandwidth about unexpected calls to it from out-of-context causing data to be corrupted, etc. OTOH, if fifty engineers are each rolling their own OO in languages that don't have standard OO features, you get chaos where they have to communicate with each other much much more instead of much much less.

Don't throw away a useful tool

Bruce J. Bell — 2025年10月01日 19:58:31 +0000

I remember reading the code for an early GUI toolkit for X (Motif?). All in C, but explicitly building and using C++ style virtual tables. It took me a while to figure out what was going on, because I hadn't learned C++ yet. I guess my point is, there is a time and a place for systematic use of indirect branching. GUI toolkits are a particular field where object orientation just makes sense, even if you have to roll it yourself.

Right. That's an example.

Ray Dillinger — 2025年10月01日 14:31:46 +0000

Opaquely indirected function calls seem worse and worse the more I deal with code that uses them. ESPECIALLY overloaded operators that invite the programmer to think s/he knows what the code means when it means something completely different. [Redacted: nightmare story about overloaded "+" intended for array mathematics being called in place of overloaded "+" intended for string concatenation, uncaught because both were stored as byte vectors of compatible underlying type as seen by the third-party GPU library that redefined "+" when nobody was looking, resulting in strings of binary garbage emerging from unchanged code that was working a few days prior. It would have been much worse if it had been the other way, because a byte vector interpreted as the (numeric) result of array mathematics isn't as obviously garbage. Operator overloading, at least in forms seen so far, is BAD. ] That's an example of exactly the kind of thing I was talking about. Features baked into the extended runtime (like opaque indirected procedure calls) make the semantics much harder to check and verify, and impose a penalty in stack memory management overhead, Not to mention compiler time and optimization efficiencies lost where things can't be proven. Even if your code doesn't use them. I know people claim about C++ that it imposes no overhead greater than C code. But the C++ runtime environment has to support things the C runtime environment doesn't. Even if the code you write is just C and runs at the same speed, that runtime environment you're linking it to is bigger, heavier, and a bit slower. Even if the code you write is just C, the C++ compiler can't optimize it as tightly because the C++ compiler has to assume the worst about what, in a C++ environment, could *POSSIBLY* be in some other file of code and change the meaning of the code it's looking at.

Efficiency and testability

shap — 2025年9月30日 04:04:20 +0000

After coding the EROS kernel in C++, we switched to C for its successor. There were three reasons for the switch:

The EROS kernel threw *no* exceptions, but we paid a 30% performance penalty due to reduced iCache utilization because of exception handler code that could never be reached.
Static model checking on languages that encourage opaquely indirected procedure calls (as in virtual functions) are much harder - the required conservative assumptions rapidly compromise useful results.
The reduction in "helpful abstraction" made it much easier to understand what the code was actually going to do on the machine. In a security kernel, that's critical.

I've said we didn't use exceptions, but we also didn't use inheritance. Given this, essentially none of the C++ features had any value for us.

The main thing we lost was a weaker sequence point specification, which wasn't a big issue for this particular code.

It is an open but unresolved question whether Rust would be better or safer for this code base.

Efficiency and testability

shap — 2025年9月30日 04:03:30 +0000

After coding the EROS kernel in C++, we switched to C for its successor. There were three reasons for the switch:

The EROS kernel threw *no* exceptions, but we paid a 30% performance penalty due to reduced iCache utilization because of exception handler code that could never be reached.
Static model checking on languages that encourage opaquely indirected procedure calls (as in virtual functions) are much harder - the required conservative assumptions rapidly compromise useful results.
The reduction in "helpful abstraction" made it much easier to understand what the code was actually going to do on the machine. In a security kernel, that's critical.

I've said we didn't use exceptions, but we also didn't use inheritance. Given this, essentially none of the C++ features had any value for us.

The main thing we lost was a weaker sequence point specification, which wasn't a big issue for this particular code.

Really old issue, no progress?

shap — 2025年9月29日 23:41:43 +0000

This has been going on for *years*, and it's just not OK.

Consequence of long tenure

Bruce J. Bell — 2025年9月27日 11:22:55 +0000

I think this virtue is not a consequence of particularly good design, but of a long period of heavy use. C is polished, but from a long time in the tumbler. All the fragile corners have been knocked off. C's current stability might not have been a foregone conclusion, but decades as the de facto foundation of systems programming is the obvious culprit for its long-term backwards compatibility.

Anyone remember "Dataflow" languages?

Ray Dillinger — 2025年9月08日 17:33:23 +0000

I have often thought that spreadsheets could be used as a substrate or organizing principle for some kind of programming language. In fact one could make a good case that spreadsheets actually ARE a special case of programming language. Human brains that don't do well with one-dimensional text that has lots of nesting levels, can often organize things better when presented with a table of values and a well-defined way to derive each from other values. Dataflow languages are by nature reactive and functional. Change an input, the computation gets done and intermediate values change as necessary, and the outputs change. Which is a lot like spreadsheets in the first place. Thing is, even though I believe this idea is driven by a good and useful impulse, I don't think this particular idea is good design. It drops out of the dataflow paradigm by introducing exactly the features of linear programming languages that tend to give humans the most confusion and trouble. In a dataflow language the way to achieve clear and consistent abstraction is with a "subroutine" that is just another page in the same language and the way to achieve I/O with the underlying system is via "cells" that are defined in terms of values that may change in the external world or which result in output to the external world when written.