Saturday, May 24, 2008
Removing duplicates from a sequence
Albert van der Horst posted to comp.lang.forth about an "elementary but surprisingly difficult problem"; here's a snippet:
The problem with Forth is that while it gives you all the building blocks needed to make high-level abstractions, it doesn't actually give you any pre-packaged ones and there's no culture of code sharing or library development in the Forth community either, so everybody has to reinvent basic sequence processing operations. In contrast, Factor's collection library makes this problem trivial.
Suppose S is our sequence, already sorted. First of all, S[0] is always in the resulting sequence, R. For n> 1, the S[n] is in the result R only if S[n] != S[n-1]. Suppose we make a new sequence, T, such that T[n] is the pair (S[n],S[n+1]). Then we can get our R by removing all pairs from T such that both elements are equal, and then mapping over the result to take the second element form each pair, and finally prepending S[0] to this.
Here is the Factor implementation:
No variables, locals or stack shuffling needed.
Here is how the code works:
Next,
Next, we want to take the second element of each pair,
Finally, we use
Now that we have two things on the stack, in the right order, we call
Of course, this approach allocates some unnecessary intermediate structure. However, it took me about 20 seconds to write, and it worked the first time, I didn't have to mess around with variables, locals, or stack operations.
It works with other data types also:
Update: Daniel Ehrenberg came up with an even shorter version; figuring it out is an exercise for the reader:
Now, removing duplicates is so common that Factor already has a word in the
If you sort then iterate, you get an O(n log n) algorithm, and also it depends on there being an intrinsic order among the elements. Instead, you can use a hashtable; if you have good hashcodes, then in theory you get an
An area giving as (addr_start, addr_end) contains items
of length ``len''. I want to get of the duplicates.
Sorting the items is pretty easy, qsort can handle that.
Now eliminating the duplicates, we need only compare one
item to the next.
Something along compact (start, end, len -- end' )
The new end should be returned, of course.
Even using variables/locals, I find this difficult.
The problem with Forth is that while it gives you all the building blocks needed to make high-level abstractions, it doesn't actually give you any pre-packaged ones and there's no culture of code sharing or library development in the Forth community either, so everybody has to reinvent basic sequence processing operations. In contrast, Factor's collection library makes this problem trivial.
Suppose S is our sequence, already sorted. First of all, S[0] is always in the resulting sequence, R. For n> 1, the S[n] is in the result R only if S[n] != S[n-1]. Suppose we make a new sequence, T, such that T[n] is the pair (S[n],S[n+1]). Then we can get our R by removing all pairs from T such that both elements are equal, and then mapping over the result to take the second element form each pair, and finally prepending S[0] to this.
Here is the Factor implementation:
[ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix
No variables, locals or stack shuffling needed.
Here is how the code works:
2 clump takes a sequence on the stack and gives a new sequence of '2 element clumps':( scratchpad ) { 1 2 2 3 4 4 4 5 6 } 2 clump .
{
{ 1 2 }
{ 2 2 }
{ 2 3 }
{ 3 4 }
{ 4 4 }
{ 4 4 }
{ 4 5 }
{ 5 6 }
}Next,
all-equal? checks if all elements in an array are equal. We want to iterate over the above sequence of pairs, discarding those where both elements are equal; this is what [ all-equal? not ] filter achieves:( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter .
{ { 1 2 } { 2 3 } { 3 4 } { 4 5 } { 5 6 } }Next, we want to take the second element of each pair,
[ second ] map:( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) 2 clump [ all-equal? not ] filter [ second ] map .
{ 2 3 4 5 6 }Finally, we use
bi because we do two things to the array; we first find the second element of each non-duplicate clump, then we take the first element of the original array:( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi . .
1
{ 2 3 4 5 6 }Now that we have two things on the stack, in the right order, we call
prefix to add that element to the original array:( scratchpad ) { 1 2 2 3 4 4 4 5 6 }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ 1 2 3 4 5 6 }Of course, this approach allocates some unnecessary intermediate structure. However, it took me about 20 seconds to write, and it worked the first time, I didn't have to mess around with variables, locals, or stack operations.
It works with other data types also:
( scratchpad ) { "a man" "a man" "a plan" "a canal" "panama" "panama" }
( scratchpad ) [ 2 clump [ all-equal? not ] filter [ second ] map ] [ first ] bi prefix .
{ "a man" "a plan" "a canal" "panama" }Update: Daniel Ehrenberg came up with an even shorter version; figuring it out is an exercise for the reader:
[ 2 clump [ = not ] assoc-filter values ] [ first ] bi prefix.Now, removing duplicates is so common that Factor already has a word in the
sets vocabulary named prune which does exactly this, but with a different algorithm:If you sort then iterate, you get an O(n log n) algorithm, and also it depends on there being an intrinsic order among the elements. Instead, you can use a hashtable; if you have good hashcodes, then in theory you get an
O(n) algorithm (but the constant factors might be worse). The idea is that instead you iterate over your sequence, and check if each element is in the hashtable. If it's not there, it's a new element, and you flag it as such then add it to the hashtable. If it's already in the hashtable, you skip it.
Thursday, May 22, 2008
SSL/TLS support added
Some time last year, Elie Chaftari implemented an OpenSSL binding in Factor (and by the way, Elie, good luck with Medioset!). I took this binding and extended it into a high-level SSL library which integrates with Factor's streams and sockets. The effort involved was non-trivial, because OpenSSL's nonblocking I/O behavior is somewhat confusing and not very well documented. So far it only works on Unix. Integration with the Windows overlapped I/O system is still pending.
I will start by presenting a simple client/server application that uses standard sockets, then convert it to use SSL.
We will implement an application which sends the current time of day, formatted as an RFC822 string, to the user.
First of all, how do we print such a string? We get the current time with
We want to send this to every client that connects, so we use the
Here is a
We can start our server as follows:
Now, let's test it with
Works fine. The log file
Now, let's write a client for this service. We'll write the client using the
Here is the complete
We're not quite done though; we should really parse the timestamp from its RFC822 string representation back into a timestamp object. We can do this with the
Let's test our program, using the
So we're done! Here is the complete program:
You can put this in a file named
So we have a simple client/server application with support for multiple connections and logging.
Let's add SSL support! There are four things to change.
Here is our complete daytime client/server vocabulary which uses SSL:
That's it.
There is one more topic to cover, and that is mixed secure/insecure servers. Often you want to listen for insecure connections on one port, and secure connections on another. We can achieve this easily enough by concatenating the results of
The server will now listen on all four addresses, and the quotation can differentiate between insecure and secure clients by checking if the value of the
So what is missing? Several things:
I will start by presenting a simple client/server application that uses standard sockets, then convert it to use SSL.
We will implement an application which sends the current time of day, formatted as an RFC822 string, to the user.
First of all, how do we print such a string? We get the current time with
now, which pushes a timestamp object on the stack, then convert it to an RFC822 string with timestamp>rfc822:( scratchpad ) USING: calendar calendar.format ;
( scratchpad ) now timestamp>rfc822 print
2008年5月22日 01:25:18 -0500
We want to send this to every client that connects, so we use the
with-server combinator from the io.server vocabulary. It takes four parameters:- A sequence of addresses to listen on. We create these addresses by passing a port number to the
internet-serverword. It gives us back a sequence of addresses:( scratchpad ) 9670 internet-server .
{
T{ inet6 f "0:0:0:0:0:0:0:0" 9670 }
T{ inet4 f "0.0.0.0" 9670 }
} - A service name for logging purposes. We pass
"daytime", which means that connections will be logged to thelogs/daytime/1.logfile in the Factor directory. We can then invokerotate-logsto move1.logto2.log, and so on. Log analysis tools are available too. I've discussed this in a prior blog post. - Finally, a quotation which is run for each client connection, in a new thread. Our quotation just prints the current time:
[ now timestamp>rfc822 print ]
- An I/O encoding specifier for client connections. We only care about ASCII so we pass
ascii.
Here is a
daytime-server word, with the complete with-server form:USING: calendar.format calendar io.server io.encodings.ascii ;
: daytime-server ( -- )
9670 internet-server
"daytime"
ascii
[ now timestamp>rfc822 print ]
with-server ;
We can start our server as follows:
USE: threads
[ daytime-server ] in-thread
Now, let's test it with
telnet:slava$ telnet localhost 9670
Trying ::1...
Connected to localhost.
Escape character is '^]'.
2008年5月22日 01:32:17 -0500
Connection closed by foreign host.
Works fine. The log file
logs/daytime/1.log will have now logged something like the following:[2008年05月22日T01:32:17-05:00] NOTICE accepted-connection: { T{ inet6 f "0:0:0:0:0:0:0:1" 50916 } }Now, let's write a client for this service. We'll write the client using the
with-client combinator from io.sockets. It takes three parameters:- An address specifier. This is the server to connect to. We're going to connect to port 9670 on
localhost, so we pass"localhost" 9670 <inet>. - An encoding specifier. Again, we're only interested in 7-bit ASCII, so we pass
ascii. - The final parameter is a quotation. We wish to a line of input from the server and leave it on the stack, so we pass
[ readln ].
Here is the complete
with-client form:USING: io io.sockets io.encodings.ascii ;
"localhost" 9670 <inet> ascii [ readln ] with-client
We're not quite done though; we should really parse the timestamp from its RFC822 string representation back into a timestamp object. We can do this with the
rfc822>timestamp word. Here is the complete daytime-client word, refactored to take a host name from the stack:USING: calendar.format io io.sockets io.encodings.ascii ;
: daytime-client ( hostname -- timestamp )
9670 <inet> ascii [ readln ] with-client
rfc822>timestamp ;
Let's test our program, using the
describe word to get a verbose description of the timestamp returned:( scratchpad ) "localhost" daytime-client describe
timestamp instance
"delegate" f
"year" 2008
"month" 5
"day" 22
"hour" 1
"minute" 37
"second" 47
"gmt-offset" T{ duration f 0 0 0 -5 0 0 }
So we're done! Here is the complete program:
USING: calendar.format calendar
io io.server io.sockets
io.encodings.ascii ;
IN: daytime
: daytime-server ( -- )
9670 internet-server
"daytime"
ascii
[ now timestamp>rfc822 print ]
with-server ;
: start-daytime-server ( -- )
[ daytime-server ] in-thread ;
: daytime-client ( hostname -- timestamp )
9670 <inet> ascii [ readln ] with-client
rfc822>timestamp ;
You can put this in a file named
work/daytime/daytime.factor in the Factor directory, then enter the following in the listener:( scratchpad ) USE: daytime
Loading P" resource:work/daytime/daytime.factor"
( scratchpad ) start-daytime-server
( scratchpad ) "localhost" daytime-client describe
So we have a simple client/server application with support for multiple connections and logging.
Let's add SSL support! There are four things to change.
- We need to add
io.sockets.secureto ourUSING:list:USING: calendar.format calendar
io io.server io.sockets io.sockets.secure
io.encodings.ascii ; - We change
daytime-serverto accept SSL connections by replacinginternet-serverwithsecure-server; we also change the port number:: daytime-server ( -- )
9671 secure-server
"daytime"
ascii
[ now timestamp>rfc822 print ]
with-server ;
Thesecure-serverword is much likeinternet-serverin that it takes a port number and outputs a list of addresses, except this time the addresses aresecureaddresses, which means that a server socket listening on one will accept SSL connections:( scratchpad ) 9671 secure-server .
{
T{ secure f T{ inet6 f "0:0:0:0:0:0:0:0" 9671 } }
T{ secure f T{ inet4 f "0.0.0.0" 9671 } }
} - Next, we change
start-daytime-serverto establish SSL configuration parameters. We need two things here, Diffie-Hellman key exchange parameters, and a certificate for the server.
Diffie-Hellman key exchange parameters can be generated with a command-line tool:openssl dhparam -out dh1024.pem 1024
There are several ways to go about obtaining a server certificate. You can either fork out some money to a CA such as VeriSign or GoDaddy (prices range from 15ドル to 3000ドル), or you can generate a self-signed certificate, for example by following these directions. Once you have the two files,dh1024.pemandserver.pem, place them insidework/daytime/, and modifystart-daytime-serverto wrap thedaytime-servercall with thewith-secure-contextcombinator. This combinator takes asecure-configobject from the stack, and we fill in the slots of this object with the pathnames of the two files we just generated:: start-daytime-server ( -- )
[
<secure-config>
"resource:work/daytime/dh1024.pem">>dh-file
"resource:work/daytime/server.pem">>key-file
[ daytime-server ] with-secure-context
] in-thread ;
Note that you must nestwith-secure-contextinsidein-thread, not vice versa, becausein-threadreturns immediately, andwith-secure-contextdestroys the context after its quotation returns. - Finally, we change
daytime-clientto establish SSL connections. It suffices to change<inet>to<inet> <secure>, and use the new port number we defined indaytime-server. This makeswith-clientestablish an SSL connection:: daytime-client ( hostname -- timestamp )
9671 <inet> <secure> ascii [ readln ] with-client
rfc822>timestamp ;
However, by default, the client will validate the server's certificate, and unless you're using a certificate signed by a root CA, this will fail.
To disable verification, we can wrap client calls in an SSL configuration with verification disabled:<secure-config> f>>verify [ "localhost" daytime-client ] with-secure-context
Here is our complete daytime client/server vocabulary which uses SSL:
USING: calendar.format calendar
io io.server io.sockets io.sockets.secure
io.encodings.ascii ;
IN: daytime
: daytime-server ( -- )
9671 secure-server
"daytime"
ascii
[ now timestamp>rfc822 print ]
with-server ;
: start-daytime-server ( -- )
[
<secure-config>
"resource:work/daytime/dh1024.pem">>dh-file
"resource:work/daytime/server.pem">>key-file
[ daytime-server ] with-secure-context
] in-thread ;
: daytime-client ( hostname -- timestamp )
9671 <inet> <secure> ascii [ readln ] with-client
rfc822>timestamp ;
: daytime-client-no-verify ( hostname -- timestamp )
#! Use this if your server has a self-signed certificate
<secure-config> f>>verify
[ daytime-client ] with-secure-context ;
That's it.
There is one more topic to cover, and that is mixed secure/insecure servers. Often you want to listen for insecure connections on one port, and secure connections on another. We can achieve this easily enough by concatenating the results of
internet-server and secure-server:( scratchpad ) 9670 internet-server 9671 secure-server append .
{
T{ inet6 f "0:0:0:0:0:0:0:0" 9670 }
T{ inet4 f "0.0.0.0" 9670 }
T{ secure f T{ inet6 f "0:0:0:0:0:0:0:0" 9671 } }
T{ secure f T{ inet4 f "0.0.0.0" 9671 } }
}
The server will now listen on all four addresses, and the quotation can differentiate between insecure and secure clients by checking if the value of the
remote-address variable is an instance of secure or not using the secure? word.So what is missing? Several things:
- As I've mentioned at the beginning, SSL sockets don't yet work on Windows. This will be addressed soon.
- Session resumption is not supported, although this is relatively easy to implement; I just wanted to get everything else working first.
- Client-side authentication isn't supported either. Again, this is pretty easy, and I'll address it soon.
- While it is very easy to create client and server SSL sockets (just wrap your addresses with
<secure>) there is currently no way to upgrade a socket that has already been opened to an SSL socket. This is needed for SMTP/TLS support. I haven't thought of a nice API for this, as soon as I do I will implement it. - Better control is needed over cypher suites and certificate validation.
Tuesday, May 20, 2008
What's up with Digg.com?
I was testing Factor's
Connecting with
So
To make matters even more confusing, Java doesn't detect the socket was closed either; the following testcase hangs:
It turns out a TCP RESET packet is being sent, but for whatever reason, Factor and Java don't receive any notification of it, wheras telnet does.
I'm completely stumped at this point. Anybody have an idea of what's going on?
Update: As Jason points out in the comments, adding a User-Agent makes it work. I was aware of this already, but that's not the interesting part; what puzzles me is why Factor (and Java) don't pick up on the socket being closed.
http.client library with digg.com, and I was just getting timeouts.Connecting with
telnet now:$ telnet digg.com 80
Trying 64.191.203.30...
Connected to digg.com.
Escape character is '^]'.
GET / HTTP/1.1
host: digg.com
Connection closed by foreign host.
So
telnet knows the connection is being dropped, but Factor doesn't detect this. I thought this might be a bug in Factor, but with all other sites I've tried, Factor correctly detects a closed connection, whereas with Digg, read() on the socket always returns -1 with errno set to EAGAIN. I'm not even sure how telnet detects the connection is being dropped.To make matters even more confusing, Java doesn't detect the socket was closed either; the following testcase hangs:
import java.net.*;
import java.io.*;
public class DiggTest
{
public static void main(String[] args) throws IOException
{
Socket s = new Socket("digg.com",80);
Writer w = new OutputStreamWriter(s.getOutputStream());
Reader r = new InputStreamReader(s.getInputStream());
w.write("GET / HTTP/1.1\r\nHost: www.digg.com\r\n\r\n");
w.flush();
int ch = r.read();
}
}
It turns out a TCP RESET packet is being sent, but for whatever reason, Factor and Java don't receive any notification of it, wheras telnet does.
I'm completely stumped at this point. Anybody have an idea of what's going on?
Update: As Jason points out in the comments, adding a User-Agent makes it work. I was aware of this already, but that's not the interesting part; what puzzles me is why Factor (and Java) don't pick up on the socket being closed.
Wednesday, May 14, 2008
Improved destructors merged into core; working on SSL binding
I don't have anything earth-shattering to blog about at this stage, so this post is mostly just a disorganized progress report for the last week or so.
Last September, Doug Coleman implement a destructors abstraction for Factor. At the start of this year, I implemented a generic resource disposal protocol. I combined these two concepts into a single vocabulary and moved it into the core.
Over the last few months, destructors have proved their worth over and over again when dealing with C libraries. Doug's post above describes them in great detail.
Here I will skim over the general concept, together with an overview of the new word names.
Suppose you need to call a C function which takes an array of three structures. The array itself can be allocated in the Factor heap, as a byte array, however the structures must be allocated in the malloc heap because the garbage collector doesn't know about pointers inside byte arrays. So we can start by writing the following code:
This is useless because it leaks memory every time it is called.
Let's keep the array around and deallocate them when we're done:
This is still problematic. Any one of the
The
The more general word is
Recent language changes, such as destructors, the new slot accessors, inheritance, and the thread refactoring, have really cleaned up the Windows I/O code in particular. It used to be absolutely hideous with too much stack shuffling and repetition, but after my next set of patches are pushed out, it is pretty close to exemplary Win32 API usage -- and much cleaner than anything you'd write in C, which is a pretty good accomplishment I think, given that Win32 is designed around C. I will blog about the Windows I/O code as soon as I finish up with the next set of changes.
What motivated moving destructors in the core is working on OpenSSL bindings. The OpenSSL API is pretty complex, especially so when doing non-blocking I/O, and I've had to overhaul the Unix I/O code to be able to plug it in. While overhauling the code I noticed a few places where my error handling logic was convoluted, and even a resource leak, so I used destructors to clean them up.
I will blog about the new
However, if you try the Factor MD5 implementation you will notice it is rather slow. While it will become faster over time, OpenSSL already has a fast implementation. Thanks to the checksum protocol and some FFI fun, you can use it:
There is also
The code is in
Last September, Doug Coleman implement a destructors abstraction for Factor. At the start of this year, I implemented a generic resource disposal protocol. I combined these two concepts into a single vocabulary and moved it into the core.
Over the last few months, destructors have proved their worth over and over again when dealing with C libraries. Doug's post above describes them in great detail.
Here I will skim over the general concept, together with an overview of the new word names.
Suppose you need to call a C function which takes an array of three structures. The array itself can be allocated in the Factor heap, as a byte array, however the structures must be allocated in the malloc heap because the garbage collector doesn't know about pointers inside byte arrays. So we can start by writing the following code:
FUNCTION: int our_c_function ( SHAPE* shapes ) ;
"SHAPE" malloc-object
"SHAPE" malloc-object
"SHAPE" malloc-object
3array>c-void*-array
our_c_function 0 < [ "Error return from C function" throw ] when
This is useless because it leaks memory every time it is called.
Let's keep the array around and deallocate them when we're done:
"SHAPE" malloc-object
"SHAPE" malloc-object
"SHAPE" malloc-object
3array [
>c-void*-array our_c_function
0 < [ "Error return from C function" throw ] when
] keep
[ free ] each
This is still problematic. Any one of the
malloc-object calls can fail, and we're throwing an exception too. The solution is to use destructors:[
"SHAPE" malloc-object &free
"SHAPE" malloc-object &free
"SHAPE" malloc-object &free
3array>c-void*-array our_c_function
0 < [ "Error return from C function" throw ] when
] with-destructors
The
&free word schedules an allocated block of memory for deallocation for when the with-destructors word returns, whether or not an error was thrown.The more general word is
&dispose, which can take any object implementing the generic resource disposal protocol. And this is where the destructors and disposal meet. The with-disposal combinator is now just a shorthand for with-destructors:[ X ] with-disposal == [ &dispose X ] with-destructors
Recent language changes, such as destructors, the new slot accessors, inheritance, and the thread refactoring, have really cleaned up the Windows I/O code in particular. It used to be absolutely hideous with too much stack shuffling and repetition, but after my next set of patches are pushed out, it is pretty close to exemplary Win32 API usage -- and much cleaner than anything you'd write in C, which is a pretty good accomplishment I think, given that Win32 is designed around C. I will blog about the Windows I/O code as soon as I finish up with the next set of changes.
What motivated moving destructors in the core is working on OpenSSL bindings. The OpenSSL API is pretty complex, especially so when doing non-blocking I/O, and I've had to overhaul the Unix I/O code to be able to plug it in. While overhauling the code I noticed a few places where my error handling logic was convoluted, and even a resource leak, so I used destructors to clean them up.
I will blog about the new
io.sockets.secure vocabulary when the OpenSSL binding is done. For now, I can show off OpenSSL checksums support. A few weeks ago I removed some duplication between our CRC32, MD5, SHA1 and SHA2 implementations by implementing a generic checksum protocol. Since then Doug also added an Adler32 implementation. The idea behind the protocol is that you have three words, checksum-bytes, checksum-stream, and checksum-file, each one taking some kind of object and a singleton representing the checksum algorithm. For example,"factor.image" md5 checksum-file .
=> B{ 177 55 229 87 205 67 139 188 136 33 111 38 3 0 32 23 }
However, if you try the Factor MD5 implementation you will notice it is rather slow. While it will become faster over time, OpenSSL already has a fast implementation. Thanks to the checksum protocol and some FFI fun, you can use it:
"factor.image" openssl-md5 checksum-file .
=> B{ 177 55 229 87 205 67 139 188 136 33 111 38 3 0 32 23 }
There is also
openssl-sha1. In fact any checksum supported by OpenSSL's EVP_* functions can be used, as follows:"factor.image" "SHA256" <openssl-checksum> checksum-file
The code is in
checksums.openssl.
Saturday, May 10, 2008
Garbage collection throughput improvements
The problem
About a month ago, I made some changes to the garbage collector. These changes included drastically decreasing the default nursery size from 16mb to 1mb. This turned out to have a negative effect on performance, with GC consuming as much as 60% of run time on some tests.
Instead of making the default nursery larger, which would just mask the problem until we hit workloads with 16 times as much allocation (or when the compiler began to generate code that was 16 times faster, or any combination of the above), I decided to investigate the root cause of the problem.
Turns out that every minor GC was taking on the order of a few milliseconds at the minimum, which is too much time. Most of the time was spent scanning the cards array for marked cards. Even if there weren't any marked cards, with a 64mb heap and 64 byte card size, that's a million array entries to check.
Faster card scanning
The first thing I did was change how the card scan loop operates. Instead of iterating over the array a byte at a time, and checking if each byte satisfies the card mark mask, we iterate over the array four bytes at a time, checking each group of four bytes against the mask. GCC should really be doing this optimization for me, because its a simple case of vectorization, but it doesn't, so I just wrote the code out by hand:
u32 *quad_ptr;
u32 quad_mask = mask | (mask << 8) | (mask << 16) | (mask << 24);
for(quad_ptr = (u32 *)first_card; quad_ptr < (u32 *)last_card; quad_ptr++)
{
if(*quad_ptr & quad_mask)
{
F_CARD *ptr = (F_CARD *)quad_ptr;
int card;
for(card = 0; card < 4; card++)
{
if(ptr[card] & mask)
{
collect_card(&ptr[card],gen,here);
ptr[card] &= ~unmask;
}
}
}
}
This improved card scan performance by a factor of 3, which makes sense, because we do four times less loop iterations, but at some point, you still have to hit memory, which dampens things out a bit.
This optimization improved performance, but not sufficiently so. I realized that card scanning hurt the most in benchmarks which didn't write into objects in the old generation at all; for example, a benchmark which operates on bignums, where the intermediate values are extremely short-lived, fills up the nursery very rapidly, and every minor GC checks all cards, which not only fills up the cache with junk, but doesn't achieve anything useful as no cards will be marked.
On the other hard, a pointer recording approach where all stores are written into a buffer is also unacceptable, because in code with a high rate of mutation, you pay a hefty price every time the buffer overflows. So I had to find a compromise.
Introducing card decks
My solution is something I haven't seen done anywhere else before, but it is pretty trivial if you think about it. It amounts to two-level card marking. We have an array of cards, where each card maps to a small amount of memory: 64 bytes in my case, but it could be anything; and an array of card decks. Each card deck corresponds to 1024 cards, and if a card is marked, then its deck must be marked too. This way, on a minor GC, we scan the deck array first, and then only check cards corresponding to marked decks.
This complicates the write barrier, because now on every store we have to mark a card and a deck, not just a card as before. I decided to counterweight this by simplifying the write barrier in another way. Formerly, it would read the card, "or" it with a mask, and store it back. This was done because the card contained an additional piece of information, and that is the offset within the card where the first object begins. Cards have two mark bits, denoting a possible pointer from this generation to either aging space or the nursery, and the other six bits were reserved for the offset of the first object. By moving object offsets (I call them "allot markers") into a separate array, I removed a read operation from the write barrier; now it simply has to write the mark mask to the card, without caring about its existing contents. This also freed up two bits in the allot marker, allowing me to switch to 256 byte cards (and 256 kilobyte card decks).
These improvements really helped on benchmarks which allocated large numbers of extremely short-lived objects, however on benchmarks which allocated longer-lived objects the improvements were still major but not so drastic. It turns out that if objects survives long enough to be copied from the nursery into aging space, then it is possible for this copying to begin to dominate running time.
Low-level optimizations
The inner loop of the GC looked like so:
void copy_handle(CELL *handle)
{
CELL pointer = *handle;
if(!immediate_p(pointer) && should_copy(pointer))
*handle = copy_object(pointer);
}
CELL collect_next(CELL scan)
{
do_slots(scan,copy_handle);
if(collecting_gen == TENURED)
do_code_slots(scan);
return scan + untagged_object_size(scan);
}
This is very elegant code;
do_slots is a higher-order function which applies a function pointer to each slot of the object. However, GCC doesn't try to optimize higher-order code at all, after all it is a C compiler not an ML compiler, and it isn't Sufficiently Smart Either! So I rewrote the above function by manually inlining do_slots() and replacing the function pointer invocation with the body of copy_handle(). The next thing I realized was that should_copy() contains a large series of conditional tests which did not depend on its input parameters, only global variables whose value was invariant for the duration of the collection:/* test if the pointer is in generation being collected, or a younger one. */
INLINE bool should_copy(CELL untagged)
{
if(in_zone(newspace,untagged))
return false;
if(collecting_gen == TENURED)
return true;
else if(HAVE_AGING_P && collecting_gen == AGING)
return !in_zone(&data_heap->generations[TENURED],untagged);
else if(HAVE_NURSERY_P && collecting_gen == NURSERY)
return in_zone(&nursery,untagged);
else
{
critical_error("Bug in should_copy",untagged);
return false;
}
}
So again, I did a bit of hand-optimization. If
foo() does not depend on the loop variable or the side effects of the body, then the following are equivalent:while(A)
{
if(foo()) B else C;
}
if(foo)
{
while(A) B;
}
else
{
while(A) C;
}
In the optimizing compiler literature, this is known as loop unswitching.
The new
collect_next() is much longer, but also much faster. It looks like this:CELL collect_next(CELL scan)
{
CELL *obj = (CELL *)scan;
CELL *end = (CELL *)(scan + binary_payload_start(scan));
obj++;
CELL newspace_start = newspace->start;
CELL newspace_end = newspace->end;
if(HAVE_NURSERY_P && collecting_gen == NURSERY)
{
CELL nursery_start = nursery.start;
CELL nursery_end = nursery.end;
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& (pointer>= nursery_start && pointer < nursery_end))
*obj = copy_object(pointer);
}
}
else if(HAVE_AGING_P && collecting_gen == AGING)
{
F_ZONE *tenured = &data_heap->generations[TENURED];
CELL tenured_start = tenured->start;
CELL tenured_end = tenured->end;
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& !(pointer>= newspace_start && pointer < newspace_end)
&& !(pointer>= tenured_start && pointer < tenured_end))
*obj = copy_object(pointer);
}
}
else if(collecting_gen == TENURED)
{
for(; obj < end; obj++)
{
CELL pointer = *obj;
if(!immediate_p(pointer)
&& !(pointer>= newspace_start && pointer < newspace_end))
*obj = copy_object(pointer);
}
do_code_slots(scan);
}
else
critical_error("Bug in collect_next",0);
return scan + untagged_object_size(scan);
}
That's a hell of a code explosion, but it made a real difference to the performance of the garbage collector.
Benchmarks
The first benchmark just allocates 3-element arrays in a loop:
: garbage 1 2 3 3array ;
: garbage-loop 150000000 [ garbage drop ] times ;
[ garbage-loop ] time
Here are the results with different nursery sizes; all times are in milliseconds and I ran each benchmark multiple times to ensure I was getting reliable results:
| Before | After | |
|---|---|---|
| 1mb nursery: | 8908 | 4461 |
| 2mb nursery: | 6455 | 4451 |
| 4mb nursery: | 5336 | 4582 |
| 8mb nursery: | 4779 | 4582 |
The second benchmark allocates larger arrays in a loop:
: garbage 100 f <array> ;
: garbage-loop 15000000 [ garbage drop ] times ;
[ garbage-loop ] time
The results are similar:
| Before | After | |
|---|---|---|
| 1mb nursery: | 10418 | 2158 |
| 2mb nursery: | 6364 | 2178 |
| 4mb nursery: | 4966 | 3223 |
| 8mb nursery: | 4249 | 3294 |
An interesting phenomenon occurs where after the GC changes, a larger nursery actually begins to hurt performance. Dan and I hypothesized that this is due to the larger nursery hurting locality, with poor algorithms masking this effect with the older GC.
The next benchmark allocates a lot of temporary arrays but stores them into larger arrays which are somewhat longer lived:
: garbage 100 f <array> ;
: garbage-loop 150 [ 10000 [ drop garbage ] map drop ] times ;
[ garbage-loop ] time
The size of the aging generation makes a difference here so I tried the benchmark with more parameters:
| Before | After | |
|---|---|---|
| 1mb nursery/2mb aging: | 8693 | 5525 |
| 2mb nursery/2mb aging: | 7318 | 5255 |
| 4mb nursery/2mb aging: | 3939 | 2628 |
| 8mb nursery/2mb aging: | 2314 | 1526 |
| 1mb nursery/8mb aging: | 3455 | 1493 |
| 4mb nursery/4mb aging: | 3024 | 1894 |
| 4mb nursery/4mb aging: | 3157 | 1577 |
| 8mb nursery/8mb aging: | 1533 | 1003 |
The above benchmarks are somewhat unrealistic, however other benchmarks showed speedups too, some more drastic than others:
| Benchmark | Before | After |
|---|---|---|
| SHA1 | 12198 | 7282 |
| Random | 18989 | 14459 |
| Sort | 15685 | 13476 |
| Raytracer | 27700 | 12966 |
| Mandelbrot | 4633 | 1847 |
Before the shrinking of the nursery, the runtimes of these benchmarks looked much like they do now; it was only until a month ago that GC time began to dominate benchmarks like that. However I like having a small nursery, and making it smaller forced me to optimize the garbage collector for higher throughput in constrained-memory conditions.
Looking ahead
The next big change I will be making to the garbage collector is switching to using mark/sweep for the old generation. This will reduce memory consumption by almost 50%.
The other side of this coin is compiler optimizations. If the compiler performed more advanced static analysis, it could eliminate much of the dynamic memory allocation in the above benchmarks, cutting GC time further. Of course, if we have a great compiler and a great GC, we'll just have great performance all around.
Monday, May 05, 2008
I/O changes, and process pipeline support
I made some improvements to the I/O system today.
The
The first two close the stream after, the latter do not. The
I've changed many usages of
Before you had to use this really ugly "design pattern":
Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by
The
The
The
Corresponds to this shell command:
Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's
And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles.
Unix:
Windows:
The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach).
On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all.
Default stream variables
The
stdio variable has been replaced by input-stream and output-stream, and there are four new words:with-input-streamwith-output-streamwith-input-stream*with-output-stream*
The first two close the stream after, the latter do not. The
with-stream and with-stream* words are still around, they expect a duplex stream, unpack it, and bind both variables.I've changed many usages of
with-stream to use one of the unidirectional variants instead. This means that you can now write code like the following:"foo.txt" utf8 [
"blah.txt" utf8 [
... read from the first file, write to the second file,
using read and write ...
] with-file-writer
] with-file-reader
Before you had to use this really ugly "design pattern":
"foo.txt" utf8 <file-reader> [
"blah.txt" utf8 <file-writer> [
<duplex-stream> [
...
] with-stream
] with-disposal
] with-disposal
Speaking of duplex streams, because they're not used by anything in the core anymore I have moved them to extra. They are still used by
<process-stream> and <client>. I added a <process-reader> and <process-writer> word for those cases where you only want a pipe in one direction; they express intention better.Pipes
The
<process-stream> word has been around for a while, and this word used pipes internally, but they were not exposed in a nice, cross-platform way, until now.The
io.pipes vocabulary contains two words:<pipe>creates a new pipe and wraps a pair of streams around it. The streams are packaged into a single duplex stream; any data written to the stream can be read back from the same stream (presumably, in a different thread). This is actually implemented with native pipes on Unix and Windows.- The
run-pipelineword takes a sequence of quotations or launch descriptors, and runs them all in parallel with input wired up as if it were a Unix shell pipe. For example,{ "cat foo.txt" "grep x" "sort" "uniq" } run-pipeline
Corresponds to the following shell command:cat foo.txt | grep x | sort | uniq
In addition, being able to place process objects and quotations in the pipeline gives you a lot of expressive power.
Appending process output to files
The
io.launcher vocabulary supports the full array of input and output redirection features, and now, pipelines. There was one missing component: redirecting process output to a file opened for appending. Now this is possible. The following Factor code:<process>
"do-stuff">>command
"log.txt" <appender>>>stderr
run-process
Corresponds to this shell command:
do-stuff 2>> do-stuff.txt
Of course, shell script is a DSL for launching processes so it is more concise than Factor. However, Factor's
io.launcher library now supports all of the features that the shell does, and its pretty easy to build a shell command parser using Chris Double's PEG library, which translates shell commands to sequences of Factor process descriptors in a pipeline.And now, I present a concise illustration of the difference between the Unix philosophy and the Windows philosophy. Here we have two pieces of code, which do the exact same thing: create a new pipe, open both ends, return a pair of handles.
Unix:
USING: system alien.c-types kernel unix math sequences
qualified io.unix.backend io.nonblocking ;
IN: io.unix.pipes
QUALIFIED: io.pipes
M: unix io.pipes:(pipe) ( -- pair )
2 "int" <c-array>
dup pipe io-error
2 c-int-array> first2
[ [ init-handle ] bi@ ] [ io.pipes:pipe boa ] 2bi ;
Windows:
USING: alien alien.c-types arrays destructors io io.windows libc
windows.types math.bitfields windows.kernel32 windows namespaces
kernel sequences windows.errors assocs math.parser system random
combinators accessors io.pipes io.nonblocking ;
IN: io.windows.nt.pipes
: create-named-pipe ( name -- handle )
{ PIPE_ACCESS_INBOUND FILE_FLAG_OVERLAPPED } flags
PIPE_TYPE_BYTE
1
4096
4096
0
security-attributes-inherit
CreateNamedPipe
dup win32-error=0/f
dup add-completion
f <win32-file> ;
: open-other-end ( name -- handle )
GENERIC_WRITE
{ FILE_SHARE_READ FILE_SHARE_WRITE } flags
security-attributes-inherit
OPEN_EXISTING
FILE_FLAG_OVERLAPPED
f
CreateFile
dup win32-error=0/f
dup add-completion
f <win32-file> ;
: unique-pipe-name ( -- string )
[
"\\\\.\\pipe\\factor-" %
pipe counter #
"-" %
32 random-bits #
"-" %
millis #
] "" make ;
M: winnt (pipe) ( -- pipe )
[
unique-pipe-name
[ create-named-pipe dup close-later ]
[ open-other-end dup close-later ]
bi pipe boa
] with-destructors ;
The Windows API makes things difficult for no reason. Anonymous pipes do not support overlapped I/O, so you have to open a named pipe with a randomly-generated name (I'm not making this up, many other frameworks do the same thing and Microsoft even recommends this approach).
On the flipside, the nice thing about supporting both Unix and Windows is that I get to come up with true high-level abstractions that make sense, instead of being able to get away with having thin wrappers over Unix system calls as so many other language implementations do. For example, Ocaml's idea of "high-level I/O" is some basic POSIX bindings, together with incomplete emulation of Unix semantics on Windows. Look forward to writing a page of code if you want to map a file into memory or run a process and read its output into a string. And of course Java doesn't support pipes, I/O redirection for launching processes, or file system change monitoring, at all.
Subscribe to:
Comments (Atom)