Friday, April 30, 2010

Importance of terminology

An upcoming study of first-year computer science students has an interesting observation about terminology. It seems that their understanding of the computer science vocabulary is much worse than one would expect. You cannot blame the teachers because I'm sure they were unaware of the problem. You cannot blame the students for being `dumb' because nearly all of them could identify a `function body'. Yet fewer than one in three could tell you what is meant by a `defined name' or a `predicate'. Really! The results surprised me, but in thinking about it, I have a couple of personal anecdotes that highlight the problem.

Every year or so there used to be a heated debate on one or more usenet groups (remember usenet?) between the static and dynamic type factions. (I won't argue either side.) One year, however, I encountered someone who challenged me about my viewpoint. He said that he knew many people who held a similar viewpoint to mine, and he had a lot of respect for them. They weren't idiots. But he could not understand their rationale. His viewpoint made perfect logical sense to him, so how could another intelligent person disagree so violently? He didn't want to argue, he wanted to understand how anyone could find the alternative viewpoint to be as logical. (He assumed it must be so, or there wouldn't be so many respectable adherents to it.)

We discussed it back and forth for a while in a non-heated way and it dawned on me that the fundamental problem is simply one of terminology. `Static typers' and `dynamic typers' have completely different understandings of the word `type'. They cannot agree on the simplest things because they each think the other knows what he means by `type'.

For the sake of clarity, I'll lay out the definitions:
  • Type1 — an equivalence class for data. Two objects are of the ‘same type’ if operations on one could be (or could have been) performed on the other with analagous results. For example, ‘integer’ is a type because objects like ‘3’ and ‘42’ can both be added, subtracted, multiplied, and divided.
  • Type2 — a purely syntactic label that can be assigned to a code fragment. There are rules for assigning such labels to primitive fragments and rules for composing types as the code fragments are composed. Certain compositions are disallowed. A well-typed program has no disallowed compositions, and a type-checking program can verify this. By clever construction of the rules, certain useful runtime properties can be proven.
I think you can see that these two definitions have almost nothing to do with each other.

Once I understood this, I stopped using the word ‘type’ and started talking about ‘tags’ — manifest metadata attached to all objects in memory that describe what each object is. ‘Static typers’ understand the utility and benefit such things, they don't understand why anyone would want to forego compile-time ‘type checks’. They don't mean the annoying task of decorating every variable and procedure in the system with a ‘type specifier’, but rather having the compiler notice that your code said to vector-ref the result of a floating point divide and that most likely is not what you intended. Notwithstanding the contrarians who complain that maybe they did want to vector-ref a float, every Lisp hacker I know of appreciates getting early notification of these kind of errors (provided that false positives are very rare and it remains unnecessary to kowtow to the compiler).

Next anecdote later...

Thursday, April 29, 2010

Persistent store: a new record type

The object map allows us to abstract away the physical addresses where our objects are stored, but only if we can remember where we put it! We need another record type and things will get much more interesting.

A commit-record is a special record that we will write to the durable storage. The commit-record will contain (among other things) the physical address of the root of the object map. We'll require a little more functionality from the durable storage layer. When we first use a previously existing persistent store, we need to ask the durable storage layer to find the most recent valid and complete commit record. It can do this by starting at the newest record and reading backwards through earlier records until it finds one.

Since computers are unreliable, we have to take into account the possibility that the computer could crash in the middle of writing a commit record. When looking for the commit record on restart, we ignore any partially written commit records. When we're writing a commit record to the durable storage, we don't return control until we're sure that the record is durably saved and complete. If the computer crashes at any time before that last byte is written, we will recover using the previous valid commit record, and if it crashes at any time after that last byte is written we'll use this one.

There is an extra burder on the durable store to be able to find the most recent commit record, but we can take a different burden off the store. Instead of insisting that every record be durably written at the time we send it to the store, we can allow most of the records to be buffered until it is more convenient to write. We only insist that the buffers be flushed prior to writing the commit record. This makes a huge difference in the performance of the store.

There is a design decision to be made at this point. When we save durable objects, we can either send the object and the object-map records to the durable store immediattely, or we can defer sending them until we want to write the commit record. The eager strategy is more responsive when we are doing a lot of writes because we can start writing records while we're computing. The lazy strategy will reduce the total amount of traffic (every time we modify the object-map, we need to write new object-map records, if newer entries supersede records that haven't even been written, we can elide the write), but when we eventually do write the commit record, there will be a flurry of activity as we force the pending writes. This will cause a noticable pause at commit time.

There is another interesting design decision. In many cases, it is ok to lose a few seconds of work if the computer crashes. You wouldn't want to lose anything if you were logging financial transactions, but if you were checkpointing a text file, you might consider it ok if two or three of the most recently typed letters didn't get saved. When the computer restarted, you'd re-open the text file and find that what you were typing wasn't completely there, but was mostly there. You can re-enter the last couple of keystrokes. If this is acceptable — and it is for some applications and not for others — then it is not necessary for commit records to be immediately written to durable storage so long as they are written in a reasonably timely manner (within a second or two). The freedom to defer writing commit records like this will also improve performance quite a bit.


Exercise: Add commit records to your durable store and the ability to recover the most recent commit record when opening an existing store.

Tuesday, April 27, 2010

Open letter to the SEC


To Whom it may concern:

In the document entitled

 SECURITIES AND EXCHANGE COMMISSION
 17 CFR Parts 200, 229, 230, 232, 239, 240, 243 and 249
 Release Nos. 33-9117; 34-61858; File No. S7-08-10
 RIN 3235-AK37 
 ASSET-BACKED SECURITIES
a number of changes to Item 601 of Regulation S-K, Rules 101, 201, 202 and 305 of Regulation S-T, a new Rule 314 of Regulation S-T and changes to Form 8-K are proposed. These changes are proposed to accommodate the filing of the ‘waterfall’ computer program. The document requests comments on this proposal.



Is it appropriate for us to require most ABS issuers to file the waterfall computer program?

As mentioned in the proposal, ABS issuers are currently required to submit an informal, ‘English language’ description of the waterfall. A computer program would be far more formal and less likely to be ambiguous or misinterpreted.

Is it appropriate to require issuers to submit the waterfall computer program in a single programming language, such as Python, to give investors the benefit of a standardized process? If so, is Python the best choice or are there other open source programming language alternatives (such as PERL) that would be better suited for these purposes?

The proposal goes to great lengths to justify the choice of Python as the language in which the waterfall program should be expressed. Unfortunately, every one of the justifications is either incorrect or irrelevant.
  • Python is an open source interpreted programming language. — Python is more accurately described as a family of programming languages with several implementations. The Python Software Foundation supports one particular implementation, but both Microsoft and Nokia offer alternative implementations. The Python Software Foundation provides an interpreter, but there are versions of Python that are fully compiled, byte-code compiled, or JIT compiled.
  • Open source means that the source code is available to all users.— This is true, but irrelevant. INTERCAL is an open source language, yet it is completely unsuitable for the purposes of writing a waterfall program. On the other hand, the Microsoft implementation of ECMAScript is proprietary, yet ECMAScript is certainly a language to consider for waterfall programs.
  • An interpreted language is a programming language that requires an interpreter in the target computer for program execution. — Interpretation is a technique for implementing a language. Few languages require an interpreter, and as noted above, compilers for Python exist.
  • Since Python is an interpreted language that does not need to be compiled prior to running it, executable code would not need to be published on EDGAR. We define executable code in Rule 11 of Regulation S-T [17 CFR 239.11] as instructions to a computer to carry out operations that use features beyond the viewer's, reader's, or Internet browser's native ability to interpret and display HTML, PDF, and static graphic files. Such code may be in binary (machine language) or in script form. Executable code includes disruptive code.— A Python program is most certainly ‘executable’ by the definition supplied.
The proposal appears to be suggesting that safety is an important desideratum. This includes the safety of the EDGAR system as well as the safety of the investors that would presumably use the waterfall program. The proposal concludes that a language that is executed by an interpreter would be inherently safer than a compiled language. This is false. There has been a considerable amount of research into computer safety, and it is clear that safety is very hard to achieve even if it is the highest priority in the design. Provable safety is not, and never has been, a goal of Python.

To conclude, I believe that a formal description of the waterfall algorithms written as a computer program may be quite beneficial to investors. The proposal, however, fails to enumerate the desiderata for the language such a program would be written in. The proposal makes a few erroneous assertions and then puts forward Python as a suggested language. While Python is a popular language, it was never intended to be used as a critical component of such an important part of the U.S. economy.

I suggest that Securities and Exchange Commission, if they intend to pursue the path of requiring a formal waterfall program to be provided, follow a formal process for finding or developing a suitable language. First, the requirements for such a language need to be specified. Second, these requirements should be compared against existing and proposed languages. Finally, one or more of the languages should be adopted and implemented.

Sincerely,
Joe Marshall

More on persistence

I haven't forgotten about persistence, I've just been busy.

When I last blogged about it, I discussed the need for an object-map. The object-map maps logical object ids to the physical location that the durable store placed the object. We'll be putting a couple more kinds of things in the durable store, so at this point it is a good idea to add a record abstraction on top of the durable store. It is easier to show this than to describe it. Here are the actual contents of a test durable store:
;;; A persistent store. Do not edit.
(object 1 929 <string> 1 "zip")
(leaf 1 38)
(object 1 269 <string> 1 "zero?")
(leaf 1 82)
(branch 1 116 0 38)
(object 1 387 <string> 1 "yield-current-thread")
(leaf 1 148)
(branch 1 0 197 82)
(leaf 1 38)
(leaf 1 82)
(branch 1 242 230 148)
(object 1 535 <string> 1 "xsubstring-move!")
(leaf 1 277)
(branch 1 322 0 38)
(branch 1 242 335 148)
(object 1 295 <string> 1 "xsubstring-find-previous-char-in-set")
A ‘record’ is simply a list. The first element is one of object, leaf, or branch. The second element is the version number of the record layout in case I decide to change how the record is represented. The remaining elements are the contents of the record. For a leaf, it is simply the address of the object that this leaf represents. For a branch, it is the addresses of the left and right child and the address of the object the branch represents. (This is using a functional binary tree as detailed in the code.) An object record has four additional components. First is the object-id. Second is the data tag of the object. Third is a version number of the serialization method for the object and finally the serialized representation of the object itself.

This is clearly not the most parsimonious representation, and it could be made much more robust by including record length information, checksums or CRCs, and by ensuring that user data (such as those strings) cannot spoof a record boundary, but this is demonstration code, not production code.

In the next installment, we add another record type.

Friday, April 23, 2010

The Education of JRM

I learn so much from the Internet. Yesterday, someone posted a link to my post about Java. I got close to 3 orders of magnitude increase in traffic. Many people commiserated with me, and I thank them. Some, however, went the extra mile:
An ellipsis has three dots.
Mea culpa. I was sloppy. Fortunately, another reader defended me:
An ellipsis at the end of a sentence is three dots, followed by a period. He/she did it right.
Alas, if I did it right it was only by accident.
There should be a space between the ellipsis and the period. Putting four dots ala George Lucas is incorrect....
I'm flattered to be placed in such good company!

One reader pointed out that the problem might not lie in Java, but within me:
This is what reflection is for. If you can't solve this problem with reflection, then you don't really have reflection.
while others offered concrete advice:
Maybe you just need to learn to type faster.
and still others tactfully suggested I broaden my horizons:
ever code in C++? this isn't unique to Java, noob.
I thank you all for the advice and the spirit in which you offered it. I believe I'll reflect upon it this weekend — assuming I really have reflection. I will continue posting in the hope that I can be further edified.

Wednesday, April 21, 2010

Whenever I write code in Java....

Whenever I write code in Java I feel like I'm filling out endless forms in triplicate.

“Ok, sir, I'll just need your type signature here, here, and ... here. Now will this be everything, or...”

‘Well, I might need to raise an exception.‘

The compiler purses its lips.“An exception? Hmmm... let's see.... Yes, I think we can do that... I have the form over here... Yes, here it is. Now I need you to list all the exceptions you expect to raise here. Oh, wait, you have other classes? We'll have to file an amendment to them. Just put the type signature here, here, ... yes, copy that list of exceptions....
Student M overhears the argument between Students A and T. “What seems to be the problem?”

Student T explains, ‘We're at an impasse. I want to be able to change the semantics of the language...’

Student A says “... and I want the semantics to be stable.”

Student T says ‘It seems to me that if the program is a tool for solving a problem, then the language itself is just another tool. I should be able to make changes to the language if it helps solve the problem.’

Student A replies “And I have no problem with that in principle, but when I'm writing a program, I need to know what the language semantics are so I can have some idea about whether my program will work. If the semantics change after I write the program, I won't even know whether I have a program. Mangle the semantics all you want, but then tell me what they are and don't change them!”

Student T says ‘I'm not going to “mangle” the semantics. But I don't want to lock myself out of a solution for your convenience. I want to be able to take whatever baseline semantics we have and tweak them as appropriate, but I can't change the semantics before I write the program the changes them!’

Student A says “This is just impossible.”

Student M interrupts “Time out, everyone. Ok, now let me get this straight. You...” and he points to student A, “want the meaning of your code to be stable, and you...” (he points to student T) “ want to be able to tailor the semantics to suit you. That's simple! Why are you arguing?”

Students A and T look confused. ‘Simple?!’

Student M says, “Of course. You both agree that changing the language will change the semantics, and conversely changing the semantics changes the language...”

Students A and T nod.

Student M turns to student A “So you write your code in your language (let's call it L), and we'll be sure that L has stable, static semantics that never change.”

Student A says “That's what I'm talking about.”

Student T objects ‘Hold on! You're just completely ignoring me! What if I don't like the semantics?’

Student M turns to student T “I'm not done, yet. You want the freedom to morph language L into language L', which is very much like L, but with maybe a few changes...”

Student T interrupts ‘.. or a lot of changes. And not just L', I want L, L', L'', L''', whatever changes I think are necessary and whenever I think they are necessary.’

Student M continues “... or a lot of changes. I get you. Now here's the question: If I have a program in language L, but I have a machine that runs a different language — call it C — what do I do?”

Student T replies ‘Write an interpreter...’

Student A answers “ ... or a compiler.”

Student M says “Bingo! Problem solved. Simply use language L as the base language for an interpreter or compiler for programs in L' (or L'' or L''', etc.)”

Student A thinks for a moment then says “Works for me. If I need to understand code written in L', I can just compile it down to L.”

Student T looks dubious. ‘Wait a second. You're suggesting I write a compiler? What if I want several different language features to mix and match? I'm supposed to write compilers for each of them?!’

Student M shrugs “Well, you're the one that wants to change the language, you can't expect student A to write them. Besides, it isn't that much work.”

Student T protests ‘Not that much work? It's a compiler! Flow control, analysis, register allocation, code generation! I just want to tweak the language, not invent a whole new one from scratch!’

Student M counters “Who said you need to do all that? A compiler is simply a program that translates code from one language to another, right? If you can express the constructs of language L' (or L'' etc.) in terms of language L, then that's all you need to do. If you're really just tweaking the language, then your ‘compiler’ is mostly the identity function.“

‘And the parts that aren't?’

“Macros.”


Is student M's solution reasonable?
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /