-
Notifications
You must be signed in to change notification settings - Fork 183
Map performance - use pattern match on size to reduce recursive function calls #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for testing this out. The improvements seem decent but I've always found this undocumented special case strange and I'm not sure we want to proliferate this pattern. If we really want to avoid visiting Tip
s why not do it properly? For foldMap
this can look like
foldMap :: Monoid m => (a -> m) -> Set a -> m foldMap _ Tip = mempty foldMap f (Set _ x l r) = go x l r where go x l r = case (l, r) of (Bin _ lx ll lr, Bin _ rx rl rr) -> go lx ll lr <> f x <> go rx rl rr (Bin _ lx ll lr, Tip) -> go lx ll lr <> f x (Tip, Bin rx rl rr) -> f x <> go rx rl rr (Tip, Tip) -> f x
But I haven't measured the effects of such a change.
The version you posted is about 10 to 15% slower than what's there already. See this commit which adds the change it to foldMap
.
It's worth asking why it's slower – it seems that pattern matching on the strict packed integer should be very fast, as there's no pointer chasing or indirection involved, pattern matching on the tips does require some indirection.
From the comment in the link above, eliding 2/3 of the recursive calls to Tip is a big deal, because there are a lot of Tips, effectively about the same number as the number of items in the map.
Maybe we should add a GHC style note about this optimisation.
Edit: I've written one.
1dd98a1
to
d2093d3
Compare
d2093d3
to
2f1eebc
Compare
e75e359
to
ec94da2
Compare
Can you share more on this? This cannot be universally true, there are trade-offs here. It also disagrees with my benchmarks below.
It does not, thanks to GHC's pointer tagging. There should be little difference, if any, between checking if an unboxed int is 1 and checking the pointer tag. Let's reason about the implementations to see what we can expect. AnalysisWe consider three implementations:
Let's consider two classes of trees, each an extreme case for leaves.
Now we track metrics that might contribute to the cost.
For case 1 we have
For case 2 we have
We can see
We definitely cannot conclude that BenchmarksI benchmarked the implementations on a simple fold (equivalent to simple :: Set a -> Int# simple (Bin _ _ l r) = simple l +# (1# +# simple r) simple Tip = 0# notip :: Set a -> Int# notip Tip = 0# notip (Bin sz x l r) = go sz x l r where go !_sz !_x l r = case l of Bin lsz lx ll lr -> case r of Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr) Tip -> go lsz lx ll lr +# 1# Tip -> case r of Bin rsz rx rl rr -> 1# +# go rsz rx rl rr Tip -> 1# bin1 :: Set a -> Int# bin1 (Bin 1 _ _ _) = 1# bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r) bin1 Tip = 0# Timings on my machine with GHC 9.10.1:
Full source code{- cabal: build-depends: base, deepseq, tasty-bench -} {-# LANGUAGE BangPatterns, MagicHash #-} {-# OPTIONS_GHC -O2 -fproc-alignment=64 #-} {-# OPTIONS_GHC -ddump-simpl -ddump-cmm -ddump-to-file #-} import Control.DeepSeq import System.IO import Data.Bits import Test.Tasty.Bench import GHC.Exts main :: IO () main = do defaultMain [ bgroup ("depth" <> show d) [ (env (pure (make1 d))) $ \t -> bgroup "case1" [ bench "simple" $ whnf simple' t , bench "notip" $ whnf notip' t , bench "bin1" $ whnf bin1' t ] , (env (pure (make2 d))) $ \t -> bgroup "case2" [ bench "simple" $ whnf simple' t , bench "notip" $ whnf notip' t , bench "bin1" $ whnf bin1' t ] ] | d <- [6, 12, 18] ] --------------- data Set a = Bin !Int !a !(Set a) !(Set a) | Tip deriving Show instance NFData a => NFData (Set a) where rnf = rwhnf -- good enough for Set Int make1 :: Int -> Set Int make1 d = go 0 d where -- d >= 1 go !off 1 = Bin 1 off Tip Tip go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1)) where lsz = bit (d-1) - 1 x = off + lsz + 1 sz = 1 + lsz + lsz make2 :: Int -> Set Int make2 d = go 0 d where -- d >= 2 go !off 2 = Bin 2 (off+1) (Bin 1 off Tip Tip) Tip go off d = Bin sz x (go off (d-1)) (go (x+1) (d-1)) where lsz = bit (d-3) * 3 - 1 x = off + lsz + 1 sz = 1 + lsz + lsz --------------- wrap :: (Set a -> Int#) -> Set a -> () wrap f s = case f s of _ -> () simple' = wrap simple simple :: Set a -> Int# simple (Bin _ _ l r) = simple l +# (1# +# simple r) simple Tip = 0# notip' = wrap notip notip :: Set a -> Int# notip Tip = 0# notip (Bin sz x l r) = go sz x l r where go !_sz !_x l r = case l of Bin lsz lx ll lr -> case r of Bin rsz rx rl rr -> go lsz lx ll lr +# (1# +# go rsz rx rl rr) Tip -> go lsz lx ll lr +# 1# Tip -> case r of Bin rsz rx rl rr -> 1# +# go rsz rx rl rr Tip -> 1# bin1' = wrap bin1 bin1 :: Set a -> Int# bin1 (Bin 1 _ _ _) = 1# bin1 (Bin _ _ l r) = bin1 l +# (1# +# bin1 r) bin1 Tip = 0# The results don't disagree with the analysis.
I haven't yet thought about what we should do based on this information, but I hope the analysis and benchmarks are reasonable. |
You need to add an UNPACK pragma.
data Set a = Bin {-# UNPACK #-} !Int !a !(Set a) !(Set a) | Tip deriving Show
Edit: removed incorrect numbers.
I'm definitely in agreement with the images, I've considered the same things.
With regards to benchmarking, I wrote up different versions of foldMap for Data.Map
proper and used the benchmark suite included.
Without an UNPACK pragma, there's an extra pointer chase.
(I think the last line in the table should be 4n-1, n, 3n-1, n).
You need to add an UNPACK pragma.
It's not necessary, because -funbox-small-strict-fields
is enabled by -O
.
(I think the last line in the table should be 4n-1, n, 3n-1, n).
True, fixed!
With regards to benchmarking, I wrote up different versions of foldMap for
Data.Map
proper and used the benchmark suite included.
But there are no foldMap benchmarks in https://github.com/haskell/containers/blob/master/containers-tests/benchmarks/Map.hs ?
Regarding your numbers, the allocations metric is surprising. There should be zero allocations. Which I can confirm when I run my file above with +RTS -T
. Do check that you're running with -O
or -O2
.
Oh yeah, I added one for testing, it I just imported Data.Monoid and use Sum.
Hmm, my numbers above weren't run with all optimisations on.
Checking if the size is 1 takes one comparison and one conditional branch. Checking if both children are Tip
requires two comparisons and two conditional branches. I would assume the comparisons are essentially free. Does the extra branch tie up branch prediction circuits?
Hmm. I'm getting significant differences with different versions of GHC.
I'm going to take some time putting together some more numbers (and maybe try to stop doing them on a laptop).
For posterity:
defaultMain [ bench "foldMap" $ nf (foldMap Sum) m , bench "foldMapWithKey" $ nf (M.foldMapWithKey (const Sum)) m ]
So I performed my initial analysis using GHC 8.10.7, where the bin 1 trick was the fastest.
GHC 8.10.7
All
foldMapWithKeySimple: OK
24.4 μs ± 2.2 μs
foldMapWithKeyBin1: OK
18.6 μs ± 1.7 μs
foldMapWithKeyNoTip: OK
21.9 μs ± 1.2 μs
But, in the 9 series, it looks like it has reversed (and the naïve case has improved too, it's closer to the bin 1 check case).
GHC 9.10.1
All
foldMapWithKeySimple: OK
19.4 μs ± 848 ns
foldMapWithKeyBin1: OK
18.8 μs ± 1.6 μs
foldMapWithKeyNoTip: OK
13.1 μs ± 766 ns
This is using the benchmarks suite the the following additions:
foldMapWithKey :: Monoid m => (k -> a -> m) -> Map k a -> m foldMapWithKey f = go where go Tip = mempty go (Bin 1 k v _ _) = f k v go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r) {-# INLINE foldMapWithKey #-} foldMapWithKeySimple :: Monoid m => (k -> a -> m) -> Map k a -> m foldMapWithKeySimple f = go where go Tip = mempty go (Bin _ k v l r) = go l `mappend` (f k v `mappend` go r) {-# INLINE foldMapWithKeySimple #-} foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m foldMapWithKeyNoTip _ Tip = mempty foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0 where go k x l r = case (l, r) of (Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr (Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x (Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr (Tip, Tip) -> f k x {-# INLINE foldMapWithKeyNoTip #-}
defaultMain [ bench "foldMapWithKeySimple" $ nf (M.foldMapWithKeySimple (const Sum)) m , bench "foldMapWithKeyBin1" $ nf (M.foldMapWithKey (const Sum)) m , bench "foldMapWithKeyNoTip" $ nf (M.foldMapWithKeyNoTip (const Sum)) m ]
I'll redo the filter, map, mapMaybe benchmarks on a latter GHC and report back later.
These are my numbers with GHC 9.10.
The "simple" cases are those in master currently, which don't use the bin 1 optimisation.
For foldMap
, I think we should probably go for using the "no tip" version. It's much faster with a modern GHC.
For map
, filter
, foldl'
, and mapMaybe
, I think using the Bin 1 optimisation is worth it, it's such a small amount of additional code (and is simple), and the improvements are pretty substantial.
What do you think?
foldMapWithKeySimple: OK
19.9 μs ± 1.8 μs
foldMapWithKeyBin1: OK
18.4 μs ± 1.4 μs
foldMapWithKeyNoTip: OK
13.6 μs ± 742 ns
map: OK
32.3 μs ± 2.9 μs
map simple: OK
41.6 μs ± 2.0 μs
filter: OK
47.4 μs ± 2.0 μs
filterSimple: OK
58.5 μs ± 5.4 μs
foldlWithKey': OK
11.5 μs ± 898 ns
foldlWithKey'simple: OK
15.8 μs ± 1.5 μs
mapMaybeWithKey: OK
99.5 μs ± 5.2 μs
mapMaybeWithKeySimple: OK
109 μs ± 7.4 μs
Which is the "no tip" version?
No Tip is Soumik's alternate traversal above.
foldMapWithKeyNoTip :: Monoid m => (k -> a -> m) -> Map k a -> m foldMapWithKeyNoTip _ Tip = mempty foldMapWithKeyNoTip f (Bin _ k0 x0 l0 r0) = go k0 x0 l0 r0 where go k x l r = case (l, r) of (Bin _ lk lx ll lr, Bin _ rk rx rl rr) -> go lk lx ll lr <> f k x <> go rk rx rl rr (Bin _ lk lx ll lr, Tip) -> go lk lx ll lr <> f k x (Tip, Bin _ rk rx rl rr) -> f k x <> go rk rx rl rr (Tip, Tip) -> f k x {-# INLINE foldMapWithKeyNoTip #-}
Prior to GHC 9.4, the Bin 1 technique is faster for foldMap
.
It seems to me that GHC 9.4's improvements to pointer tagging (tag inference in particular) make a huge improvement in this case.
GHC 9.2.4
foldMapWithKeySimple: OK
24.6 μs ± 2.0 μs
foldMapWithKeyBin1: OK
17.7 μs ± 1.0 μs
foldMapWithKeyNoTip: OK
19.4 μs ± 607 ns
GHC 9.4.7
foldMapWithKeySimple: OK
19.7 μs ± 924 ns
foldMapWithKeyBin1: OK
18.4 μs ± 870 ns
foldMapWithKeyNoTip: OK
13.0 μs ± 715 ns
It seems plausible that rethinking more traversals along these lines could work well, but they are more complex, and my attempts with foldl'
were not fruitful (2x slower than the naïve case, let alone the bin 1 version).
So I think just taking the 25% improvements with the Bin 1 trick for map
and filter
and getting full benefit of GHC's improvements on foldMap
seems like a good compromise.
In terms of this PR, all functions which are touched are improvements (and it doesn't touch foldMap
anyway).
I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.
I'm going to test these metrics against shuffled inputs. But it might be all bullshit (including the initial use of it?).
I'm sorry guys, this has been a real roundabout.
I'm sorry guys, this has been a real roundabout.
More experimentation is good! Don't worry about it.
I did some analysis of the maps used in the benchmark suite. Because they use the optimised ascending functions they produce extremely balances trees... only 1 node has a single child and 50% match the Bin 1 condition.
I'm going to test these metrics against shuffled inputs.
Sounds good. I think we should update our Set/Map benchmarks to run on randomly structured trees instead of the very specific fromList [1..n]
wherever appropriate. It would be more representative of an "average" case.
Creating the Set/Map from a shuffled list should do it, as you said. I can make a PR for it this weekend. If you want to beat me to it, go for it. I added a few random tests in #1056, some of that code will be useful.
Just a drive-by comment: If this general path ends up being a win overall, then it may be worth considering a redesign of the datatype to avoid/reduce the occurrence of Tip
in the first place. (Of course, doing so would require a whole lot of benchmarking to make sure to avoid regressions elsewhere.) I did a lot of that sort of datatype redesign when developing my bytestring-trie, and it can yield significant wins so long as the algorithms can avoid a proliferation of special-case branching
That's a good idea, I've opened #1073 to track it.
Uh oh!
There was an error while loading. Please reload this page.
This change incorporates the idea already used for
foldMap
andtraverse
, where wepattern match on the size of "leaf nodes". If the size is 1, we already know both sides
are tips, and we don't need to recurse.
Of particular interest is
mapMaybe
, where we can also omit the extra function calls to link.Benchmarks show improvements around 10 to 20% for most operations affected,
though
foldl'
seems closer to 30%.Some (very) cursory analysis says that about 1/3 of Bin nodes in a tree should have no
children, hence changes up to 30% do seem plausible.
Pattern matching on a strict and unboxed
Int
should be very fast.