Return to Answer

Fixed typo, new stats, link to repo and references to source

edited Apr 10, 2013 at 21:07

Okay, first let's establish a baseline. I'm on a 32-bit system now, so the allocation figures and also the timings are somewhat different than on a 64-bit system, but the trends are mostly the same. To see what we deal with, let's not compile for profiling, but just plain with optimisations (-O2). That gives

I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous seqs, and slightly changed applyLoop, to

applyLoop i v
 | i == n = v
 | otherwise = applyLoop (i + 1) $ f i v

which changes the result if apply is initially called with a zero argument (the new one immediately returns the original argument, while the old looped until i reaches -1 after wrap-around of the counter), but for positive arguments the semantics remain unchanged. For very small positive n, the one extra comparison the new version does might make a measurable difference if f is inlined and trivial, but for nontrivial n or f, the extra comparison is negligible, and the new version avoids code duplication that happens with the old.

That gives

$ ./originaltest3 +RTS -s
array (0,1) [(0,0),(1,1)]
 560,049,092112 bytes allocated in the heap
 32,160 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 1083 colls, 0 par 0.03s02s 0.03s02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.79s35s ( 1.88s35s elapsed)
GC time 0.03s02s ( 0.03s02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.82s37s ( 1.91s37s elapsed)
%GC time 1.5%3% (1.4%3% elapsed)
Alloc rate 312415,805944,732041 bytes per MUT second
Productivity 98.5%6% of total user, 9398.9%4% of total elapsed

$ ./test4 +RTS -s
(0,1)
 47,492 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.06s04s ( 0.06s04s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.06s04s ( 0.06s04s elapsed)
%GC time 0.4%5% (0.4%6% elapsed)
Alloc rate 8381,256182,455 bytes per MUT second
Productivity 9998.2%9% of total user, 104103.7%1% of total elapsed

Rec {
Main.$wapplyLoopmain_$s$wapplyLoop [Occ=LoopBreaker]
 :: GHC.Prim.Int#
 -> GHC.Prim.Word#
 -> GHC.Prim.Word#
 -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.$wapplyLoopmain_$s$wapplyLoop =
 \ (wwsc :: GHC.Prim.Int#)
 (ww1sc1 :: GHC.Prim.Word#)
 (ww2sc2 :: GHC.Prim.Word#) ->
 case wwsc of wild {
 __DEFAULT ->
 Main.$wapplyLoopmain_$s$wapplyLoop
 (GHC.Prim.+# wild 1)
 (GHC.Prim.plusWord# ww2sc2 (__word 1))
 (GHC.Prim.minusWord# ww1sc1 (__word 1));
 999999910000000 -> (# GHC.Word.W32# (GHC.Prim.plusWord# ww2 (__word 1))sc1, GHC.Word.W32# (GHC.Prim.minusWord# ww1 (__word 1))sc2 #)
 }
end Rec }

You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser, without. Without it, you get a loop using boxed Word32s building up huge thunks. The seqs that originally were in apply resp. in the function (\_ x -> x `seq` testN x) passed to it make no difference whatsoever.

allocating a new ByteArray# - which is pretty fast, GHC's runtime is very good at such things, but nevertheless, 10 million allocations sum up to non-negligible time - and copying the contents of the old array over. The __pkg_ccall is a quite fast C call, but it is considerably slower than a Haskell call, and calling over to C for copying 8 bytes is wasteful (the call takes much longer than the copying). I think - though I can't verify at the moment - that on a 64-bit system that is done without calling out to C, which would be much faster. Replacing the thaw with a manual copying loop for an unsafeNewArray_ of the appropriate size (manual_copy.hs) brings down the time to 0.965 seconds - but increases the allocation figures.:

$ ./manual_copy +RTS -s
array (0,1) [(0,0),(1,1)]
 760,049,112 bytes allocated in the heap
 41,424 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 1469 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.62s ( 0.63s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.64s ( 0.65s elapsed)
%GC time 3.3% (3.2% elapsed)
Alloc rate 1,226,997,169 bytes per MUT second
Productivity 96.6% of total user, 95.3% of total elapsed

Note that it would be different for a large array, then calling to memcpy of the C standard library would pay off.

Using unsafeThaw here (unsafeThaw.hs) avoids allocating new ByteArray#s (the underlying raw arrays that carry the payload of UArrays) each iteration, and the involved copying, so it unsurprisingly speeds up the program, and reduces allocation:

$ ./arrtestunsafeThaw +RTS -sarrays
array (0,1) [(0,0),(1,1)]
 400,049,112 bytes allocated in the heap
 27,596 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 764 colls, 0 par 0.02s01s 0.02s01s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s0003s 0.0002s0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.37s28s ( 0.41s28s elapsed)
GC time 0.02s01s ( 0.02s01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.39s29s ( 0.42s29s elapsed)
%GC time 4.1%4% (34.7%3% elapsed)
Alloc rate 1,070448,918146,454787 bytes per MUT second
Productivity 95.8%5% of total user, 8895.7%6% of total elapsed

So the speedup is about ×ばつ compared to the original - not bad, but still a far cry from the test4 time - and the allocation is reduced by 160 million bytes (10 million times 8 bytes for the two Word32 plus 8 for bookkeeping [size] of the ByteArray#), but it is still huge.

in each iteration it creates a new UArray Int Word32. We can avoid that by returning the argument arrayIt allocates a new - after we sneakily changedUArray constructor, and pointers to the contentscomponents, lower bound, upper bound, number of itelements and the ByteArray#. We can avoid that by returning the argument array - after we sneakily changed the contents of its ByteArray# (unsafeThaw2.hs):

$ ./arrtestunsafeThaw2 +RTS -s
array (0,1) [(0,0),(1,1)]
 80,049,100 bytes allocated in the heap
 6,828 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.22s17s ( 0.23s17s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.23s17s ( 0.23s17s elapsed)
%GC time 1.8%7% (1.7% elapsed)
Alloc rate 355480,962150,107792 bytes per MUT second
Productivity 98.2%1% of total user, 98.3%0% of total elapsed

You can get much better performance if you change the type of apply to be monadic, and use a monadic version of test3, e.g. (cleanST.hs)

that comes close to the tight loop:

$ ./cleanST +RTS -s
array (0,1) [(0,0),(1,1)]
 49,148 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s0002s 0.0003s0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.08s05s ( 0.10s06s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.09s06s ( 0.10s06s elapsed)
%GC time 0.3%4% (0.3%4% elapsed)
Alloc rate 605905,790470 bytes per MUT second
Productivity 99.4%2% of total user, 8297.2%3% of total elapsed

but is of course still slower because the array reads and writes go to the L1 cache, while test4 operated completely in registers.

If you are prepared to get your hands really dirty, you can use low-level primitives (low_level.hs),

$ ./dirtylow_level +RTS -sarrays
array (0,1) [(0,0),(1,1)]
 49,112 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s0002s 0.0003s0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.07s05s ( 0.09s06s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.08s06s ( 0.09s06s elapsed)
%GC time 0.4% (0.3%4% elapsed)
Alloc rate 659928,120001 bytes per MUT second
Productivity 99.4%0% of total user, 9095.6%1% of total elapsed

Note: new (better) timings, the machine was busier when the original runs were made

$ ./original +RTS -s
array (0,1) [(0,0),(1,1)]
 560,049,092 bytes allocated in the heap
 32,160 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 1083 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.79s ( 1.88s elapsed)
GC time 0.03s ( 0.03s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.82s ( 1.91s elapsed)
%GC time 1.5% (1.4% elapsed)
Alloc rate 312,805,732 bytes per MUT second
Productivity 98.5% of total user, 93.9% of total elapsed

$ ./test4 +RTS -s
(0,1)
 47,492 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.06s ( 0.06s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.06s ( 0.06s elapsed)
%GC time 0.4% (0.4% elapsed)
Alloc rate 838,256 bytes per MUT second
Productivity 99.2% of total user, 104.7% of total elapsed

Rec {
Main.$wapplyLoop [Occ=LoopBreaker]
 :: GHC.Prim.Int#
 -> GHC.Prim.Word#
 -> GHC.Prim.Word#
 -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.$wapplyLoop =
 \ (ww :: GHC.Prim.Int#)
 (ww1 :: GHC.Prim.Word#)
 (ww2 :: GHC.Prim.Word#) ->
 case ww of wild {
 __DEFAULT ->
 Main.$wapplyLoop
 (GHC.Prim.+# wild 1)
 (GHC.Prim.plusWord# ww2 (__word 1))
 (GHC.Prim.minusWord# ww1 (__word 1));
 9999999 -> (# GHC.Word.W32# (GHC.Prim.plusWord# ww2 (__word 1)), GHC.Word.W32# (GHC.Prim.minusWord# ww1 (__word 1)) #)
 }
end Rec }

You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser, without it, you get a loop using boxed Word32s building up huge thunks. The seqs in apply resp. in the function (\_ x -> x `seq` testN x) passed to it make no difference whatsoever.

allocating a new ByteArray# - which is pretty fast, GHC's runtime is very good at such things, but nevertheless, 10 million allocations sum up to non-negligible time - and copying the contents of the old array over. The __pkg_ccall is a quite fast C call, but it is considerably slower than a Haskell call, and calling over to C for copying 8 bytes is wasteful (the call takes much longer than the copying). I think - though I can't verify at the moment - that on a 64-bit system that is done without calling out to C, which would be much faster. Replacing the thaw with a manual copying loop for an unsafeNewArray_ of the appropriate size brings down the time to 0.9 seconds - but increases the allocation figures.

Using unsafeThaw here avoids allocating new ByteArray#s each iteration, and the involved copying, so it unsurprisingly speeds up the program, and reduces allocation:

$ ./arrtest +RTS -sarray (0,1) [(0,0),(1,1)]
 400,049,112 bytes allocated in the heap
 27,596 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 764 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.37s ( 0.41s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.39s ( 0.42s elapsed)
%GC time 4.1% (3.7% elapsed)
Alloc rate 1,070,918,454 bytes per MUT second
Productivity 95.8% of total user, 88.7% of total elapsed

So the speedup is about ×ばつ compared to the original - not bad, but still a far cry from the test4 time - and the allocation is reduced by 160 million bytes, but it is still huge.

in each iteration it creates a new UArray Int Word32. We can avoid that by returning the argument array - after we sneakily changed the contents of it ByteArray#:

$ ./arrtest +RTS -s
array (0,1) [(0,0),(1,1)]
 80,049,100 bytes allocated in the heap
 6,828 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.22s ( 0.23s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.23s ( 0.23s elapsed)
%GC time 1.8% (1.7% elapsed)
Alloc rate 355,962,107 bytes per MUT second
Productivity 98.2% of total user, 98.3% of total elapsed

You can get much better performance if you change the type of apply to be monadic, and use a monadic version of test3, e.g.

that comes close to the tight loop

$ ./cleanST +RTS -s
array (0,1) [(0,0),(1,1)]
 49,148 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.08s ( 0.10s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.09s ( 0.10s elapsed)
%GC time 0.3% (0.3% elapsed)
Alloc rate 605,790 bytes per MUT second
Productivity 99.4% of total user, 82.2% of total elapsed

but is of course still slower because the reads and writes go to the L1 cache, while test4 operated completely in registers.

If you are prepared to get your hands really dirty, you can use low-level primitives,

$ ./dirty +RTS -sarray (0,1) [(0,0),(1,1)]
 49,112 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.07s ( 0.09s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.08s ( 0.09s elapsed)
%GC time 0.4% (0.3% elapsed)
Alloc rate 659,120 bytes per MUT second
Productivity 99.4% of total user, 90.6% of total elapsed

I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous seqs, and slightly changed applyLoop, to

applyLoop i v
 | i == n = v
 | otherwise = applyLoop (i + 1) $ f i v

That gives

$ ./test3 +RTS -s
array (0,1) [(0,0),(1,1)]
 560,049,112 bytes allocated in the heap
 32,160 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 1083 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.35s ( 1.35s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 1.37s ( 1.37s elapsed)
%GC time 1.3% (1.3% elapsed)
Alloc rate 415,944,041 bytes per MUT second
Productivity 98.6% of total user, 98.4% of total elapsed

$ ./test4 +RTS -s
(0,1)
 47,492 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.04s ( 0.04s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.04s ( 0.04s elapsed)
%GC time 0.5% (0.6% elapsed)
Alloc rate 1,182,455 bytes per MUT second
Productivity 98.9% of total user, 103.1% of total elapsed

Rec {
Main.main_$s$wapplyLoop [Occ=LoopBreaker]
 :: GHC.Prim.Int#
 -> GHC.Prim.Word#
 -> GHC.Prim.Word#
 -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.main_$s$wapplyLoop =
 \ (sc :: GHC.Prim.Int#)
 (sc1 :: GHC.Prim.Word#)
 (sc2 :: GHC.Prim.Word#) ->
 case sc of wild {
 __DEFAULT ->
 Main.main_$s$wapplyLoop
 (GHC.Prim.+# wild 1)
 (GHC.Prim.plusWord# sc2 (__word 1))
 (GHC.Prim.minusWord# sc1 (__word 1));
 10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #)
 }
end Rec }

You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser. Without it, you get a loop using boxed Word32s building up huge thunks. The seqs that originally were in apply resp. in the function (\_ x -> x `seq` testN x) passed to it make no difference whatsoever.

$ ./manual_copy +RTS -s
array (0,1) [(0,0),(1,1)]
 760,049,112 bytes allocated in the heap
 41,424 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 1469 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.62s ( 0.63s elapsed)
GC time 0.02s ( 0.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.64s ( 0.65s elapsed)
%GC time 3.3% (3.2% elapsed)
Alloc rate 1,226,997,169 bytes per MUT second
Productivity 96.6% of total user, 95.3% of total elapsed

Note that it would be different for a large array, then calling to memcpy of the C standard library would pay off.

$ ./unsafeThaw +RTS -s
array (0,1) [(0,0),(1,1)]
 400,049,112 bytes allocated in the heap
 27,596 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 764 colls, 0 par 0.01s 0.01s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.28s ( 0.28s elapsed)
GC time 0.01s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.29s ( 0.29s elapsed)
%GC time 4.4% (4.3% elapsed)
Alloc rate 1,448,146,787 bytes per MUT second
Productivity 95.5% of total user, 95.6% of total elapsed

in each iteration it creates a new UArray Int Word32. It allocates a new UArray constructor, and pointers to the components, lower bound, upper bound, number of elements and the ByteArray#. We can avoid that by returning the argument array - after we sneakily changed the contents of its ByteArray# (unsafeThaw2.hs):

$ ./unsafeThaw2 +RTS -s
array (0,1) [(0,0),(1,1)]
 80,049,100 bytes allocated in the heap
 6,828 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s
Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.17s ( 0.17s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.17s ( 0.17s elapsed)
%GC time 1.7% (1.7% elapsed)
Alloc rate 480,150,792 bytes per MUT second
Productivity 98.1% of total user, 98.0% of total elapsed

You can get much better performance if you change the type of apply to be monadic, and use a monadic version of test3, e.g. (cleanST.hs)

that comes close to the tight loop:

$ ./cleanST +RTS -s
array (0,1) [(0,0),(1,1)]
 49,148 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.05s ( 0.06s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.06s ( 0.06s elapsed)
%GC time 0.4% (0.4% elapsed)
Alloc rate 905,470 bytes per MUT second
Productivity 99.2% of total user, 97.3% of total elapsed

but is of course still slower because the array reads and writes go to the L1 cache, while test4 operated completely in registers.

If you are prepared to get your hands really dirty, you can use low-level primitives (low_level.hs),

$ ./low_level +RTS -s
array (0,1) [(0,0),(1,1)]
 49,112 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.05s ( 0.06s elapsed)
GC time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.06s ( 0.06s elapsed)
%GC time 0.4% (0.4% elapsed)
Alloc rate 928,001 bytes per MUT second
Productivity 99.0% of total user, 95.1% of total elapsed

Note: new (better) timings, the machine was busier when the original runs were made

Source Link

answered Apr 10, 2013 at 15:21

Daniel Fischer

answered Apr 10, 2013 at 15:21

Daniel Fischer

$ ./original +RTS -s
array (0,1) [(0,0),(1,1)]
 560,049,092 bytes allocated in the heap
 32,160 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 1083 colls, 0 par 0.03s 0.03s 0.0000s 0.0001s
 Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0003s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 1.79s ( 1.88s elapsed)
 GC time 0.03s ( 0.03s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 1.82s ( 1.91s elapsed)
 %GC time 1.5% (1.4% elapsed)
 Alloc rate 312,805,732 bytes per MUT second
 Productivity 98.5% of total user, 93.9% of total elapsed

in contrast to

$ ./test4 +RTS -s
(0,1)
 47,492 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
 Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 0.06s ( 0.06s elapsed)
 GC time 0.00s ( 0.00s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 0.06s ( 0.06s elapsed)
 %GC time 0.4% (0.4% elapsed)
 Alloc rate 838,256 bytes per MUT second
 Productivity 99.2% of total user, 104.7% of total elapsed

from test4. The difference is enormous. Well, let's look at the core to find out why, and how we can reduce it.

First, the core for using test4:

Rec {
Main.$wapplyLoop [Occ=LoopBreaker]
 :: GHC.Prim.Int#
 -> GHC.Prim.Word#
 -> GHC.Prim.Word#
 -> (# GHC.Word.Word32, GHC.Word.Word32 #)
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL]
Main.$wapplyLoop =
 \ (ww :: GHC.Prim.Int#)
 (ww1 :: GHC.Prim.Word#)
 (ww2 :: GHC.Prim.Word#) ->
 case ww of wild {
 __DEFAULT ->
 Main.$wapplyLoop
 (GHC.Prim.+# wild 1)
 (GHC.Prim.plusWord# ww2 (__word 1))
 (GHC.Prim.minusWord# ww1 (__word 1));
 9999999 ->
 (# GHC.Word.W32# (GHC.Prim.plusWord# ww2 (__word 1)),
 GHC.Word.W32# (GHC.Prim.minusWord# ww1 (__word 1)) #)
 }
end Rec }

You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser, without it, you get a loop using boxed Word32s building up huge thunks. The seqs in apply resp. in the function (\_ x -> x `seq` testN x) passed to it make no difference whatsoever.

So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?

The applyLoop worker in this case begins with

case GHC.ST.runSTRep
 @ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32)
 (\ (@ s) (s :: GHC.Prim.State# s) ->
 let {
 n# [Dmd=Just L] :: GHC.Prim.Int#
 [LclId, Str=DmdType]
 n# = GHC.Prim.sizeofByteArray# ww4 } in
 case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) ->
 case {__pkg_ccall array-0.4.0.1 memcpy forall s.
 GHC.Prim.MutableByteArray# s
 -> GHC.Prim.ByteArray#
 -> GHC.Prim.Word#
 -> GHC.Prim.State# GHC.Prim.RealWorld
 -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}

allocating a new ByteArray# - which is pretty fast, GHC's runtime is very good at such things, but nevertheless, 10 million allocations sum up to non-negligible time - and copying the contents of the old array over. The __pkg_ccall is a quite fast C call, but it is considerably slower than a Haskell call, and calling over to C for copying 8 bytes is wasteful (the call takes much longer than the copying). I think - though I can't verify at the moment - that on a 64-bit system that is done without calling out to C, which would be much faster. Replacing the thaw with a manual copying loop for an unsafeNewArray_ of the appropriate size brings down the time to 0.9 seconds - but increases the allocation figures.

Using unsafeThaw here avoids allocating new ByteArray#s each iteration, and the involved copying, so it unsurprisingly speeds up the program, and reduces allocation:

$ ./arrtest +RTS -sarray (0,1) [(0,0),(1,1)]
 400,049,112 bytes allocated in the heap
 27,596 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 764 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s
 Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 0.37s ( 0.41s elapsed)
 GC time 0.02s ( 0.02s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 0.39s ( 0.42s elapsed)
 %GC time 4.1% (3.7% elapsed)
 Alloc rate 1,070,918,454 bytes per MUT second
 Productivity 95.8% of total user, 88.7% of total elapsed

So the speedup is about ×ばつ compared to the original - not bad, but still a far cry from the test4 time - and the allocation is reduced by 160 million bytes, but it is still huge.

That is because

of _ { (# ipv4, ipv5 #) ->
(# ipv4,
 Data.Array.Base.UArray
 @ GHC.Types.Int @ GHC.Word.Word32 ww1 ww2 ww3 ipv5 #)
}

in each iteration it creates a new UArray Int Word32. We can avoid that by returning the argument array - after we sneakily changed the contents of it ByteArray#:

test3 :: UArray Int Word32 -> UArray Int Word32
test3 arr = runST (change arr) where
 change a' = do
 a <- Data.Array.Unsafe.unsafeThaw a' :: ST s (STUArray s Int Word32)
 x0 <- unsafeRead a 0
 x1 <- unsafeRead a 1
 unsafeWrite a 0 (x1 + 1)
 unsafeWrite a 1 (x0 - 1)
 return a'

which results in

$ ./arrtest +RTS -s
array (0,1) [(0,0),(1,1)]
 80,049,100 bytes allocated in the heap
 6,828 bytes copied during GC
 42,632 bytes maximum residency (2 sample(s))
 22,904 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s
 Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 0.22s ( 0.23s elapsed)
 GC time 0.00s ( 0.00s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 0.23s ( 0.23s elapsed)
 %GC time 1.8% (1.7% elapsed)
 Alloc rate 355,962,107 bytes per MUT second
 Productivity 98.2% of total user, 98.3% of total elapsed

another nearly twofold speedup and allocation figures reduced by 80%.

It takes so much longer and allocates so much more than the tight loop mostly because each iteration costs a GHC.ST.runSTRep, which is NOINLINE and inhibits a worker-wrapper transform of test3 that would allow cheaper access to the ByteArray#.

You can get much better performance if you change the type of apply to be monadic, and use a monadic version of test3, e.g.

import Data.Word
import Data.Array.Unboxed
import Data.Array.ST
import Data.Array.Base
import Control.Monad.ST
test0 :: STUArray s Int Word32 -> ST s (STUArray s Int Word32)
test0 arr = do
 x0 <- unsafeRead arr 0
 x1 <- unsafeRead arr 1
 unsafeWrite arr 0 (x1+1)
 unsafeWrite arr 1 (x0 - 1)
 return arr
applyM :: Monad m => Int -> (Int -> a -> m a) -> a -> m a
applyM n f v0 = do
 let loop i v
 | i < n = do
 v' <- f i v
 loop (i+1) v'
 | otherwise = return v
 loop 0 v0
main :: IO ()
main = let arr = listArray (0,1) [0,1] :: UArray Int Word32
 in print $ runSTUArray $ do
 marr <- thaw arr
 applyM 10000000 (\_ a -> test0 a) marr

that comes close to the tight loop

$ ./cleanST +RTS -s
array (0,1) [(0,0),(1,1)]
 49,148 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
 Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 0.08s ( 0.10s elapsed)
 GC time 0.00s ( 0.00s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 0.09s ( 0.10s elapsed)
 %GC time 0.3% (0.3% elapsed)
 Alloc rate 605,790 bytes per MUT second
 Productivity 99.4% of total user, 82.2% of total elapsed

but is of course still slower because the reads and writes go to the L1 cache, while test4 operated completely in registers.

If you don't want to change the type of apply, I know no clean way of coming close to the performance of the tight loop.

If you are prepared to get your hands really dirty, you can use low-level primitives,

{-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-}
import Data.Word
import Data.Array.Unboxed
import Data.Array.Base
import GHC.Base
test3 :: UArray Int Word32 -> UArray Int Word32
test3 arr = change arr where
 swp mbarr s =
 case readWord32Array# mbarr 0# s of
 (# s1, w0 #) ->
 case readWord32Array# mbarr 1# s1 of
 (# s2, w1 #) ->
 case writeWord32Array# mbarr 0# (w1 `plusWord#` 1##) s2 of
 s3 -> writeWord32Array# mbarr 1# (w0 `minusWord#` 1##) s3
 change arr@(UArray _ _ _ ba)
 = case swp (unsafeCoerce# ba) realWorld# of
 !s -> arr

which gives more or less the same figures as cleanST,

$ ./dirty +RTS -sarray (0,1) [(0,0),(1,1)]
 49,112 bytes allocated in the heap
 1,756 bytes copied during GC
 42,632 bytes maximum residency (1 sample(s))
 18,808 bytes maximum slop
 1 MB total memory in use (0 MB lost due to fragmentation)
 Tot time (elapsed) Avg pause Max pause
 Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s
 Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s
 INIT time 0.00s ( 0.00s elapsed)
 MUT time 0.07s ( 0.09s elapsed)
 GC time 0.00s ( 0.00s elapsed)
 EXIT time 0.00s ( 0.00s elapsed)
 Total time 0.08s ( 0.09s elapsed)
 %GC time 0.4% (0.3% elapsed)
 Alloc rate 659,120 bytes per MUT second
 Productivity 99.4% of total user, 90.6% of total elapsed

and more-or-less equivalent core (cleanST's is cleaner), but is really as dirty as it looks. I cannot recommend going down that route except in extreme circumstances.

lang-hs