Make Set/Map balance calls faster #1053

New issue

Closed

#1056

Closed

#1056

Labels

Map Set performance

@meooow25

Description

@meooow25

meooow25

opened

on Oct 12, 2024

I've been looking at #980 which improves insertion times by ~22% in the no-rebalance case. This made me wonder: why is rebalancing slow?

Let's look at Set because it is simpler. Map is similar.
From the code, we can see that balanceL and balanceR are marked as
NOINLINE.

containers/containers/src/Data/Set/Internal.hs

Lines 1885 to 1906 in 9b1d9d4

balanceR :: a -> Set a -> Set a -> Set a

balanceR x l r = case l of

Tip -> case r of

Tip -> Bin 1 x Tip Tip

(Bin _ _ Tip Tip) -> Bin 2 x Tip r

(Bin _ rx Tip rr@(Bin _ _ _ _)) -> Bin 3 rx (Bin 1 x Tip Tip) rr

(Bin _ rx (Bin _ rlx _ _) Tip) -> Bin 3 rlx (Bin 1 x Tip Tip) (Bin 1 rx Tip Tip)

(Bin rs rx rl@(Bin rls rlx rll rlr) rr@(Bin rrs _ _ _))

| rls < ratio*rrs -> Bin (1+rs) rx (Bin (1+rls) x Tip rl) rr

| otherwise -> Bin (1+rs) rlx (Bin (1+size rll) x Tip rll) (Bin (1+rrs+size rlr) rx rlr rr)

(Bin ls _ _ _) -> case r of

Tip -> Bin (1+ls) x l Tip

(Bin rs rx rl rr)

| rs > delta*ls -> case (rl, rr) of

(Bin rls rlx rll rlr, Bin rrs _ _ _)

| rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr

| otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)

(_, _) -> error "Failure in Data.Set.balanceR"

| otherwise -> Bin (1+ls+rs) x l r

{-# NOINLINE balanceR #-}

The NOINLINEs were added in b96ffeb citing code bloat issues. I think that's reasonable, balanceL and balanceR are moderately large functions. But let's try inlining them to see what happens.

We'll use the existing insert benchmark, which is pretty good for this purpose. It repeatedly inserts a maximum element into the set. This is close enough to typical usage (repeated insertion) but also a bad case (because it always makes one branch heavier). Benchmarks use GHC 9.6.6.

Current
 insert: OK
 684 μs ± 33 μs, 2.6 MB allocated, 55 KB copied, 7.0 MB peak memory
INLINE balanceR
 insert: OK
 463 μs ± 20 μs, 2.6 MB allocated, 56 KB copied, 7.0 MB peak memory, 32% less than baseline

That's a significant improvement, but it might come at the cost of code bloat.

But, there is a compromise:
Most calls to balanceR will be on already balanced trees. Performing rebalancing is the rare case. Using some unsafe IO, I gathered some counts to show this:

L.foldl' (flip insert) empty [1..100]

-- 1 to 50
-- rebalances
0 0 1 0 1 1 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1
0 1 2 2 3 3 3 3 4 4 4 4 5 5 4 4 5 5 5 5 6 6 5 5 6 6 6 6 7 7 5 5 6 6 6 6 7 7 6 6 7 7 7 7 8 8 6 6 7 7
-- calls to balanceR
-- 51 to 100
-- rebalances
1 0 1 2 1 0 1 1 1 0 1 4 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 4 1 0 1 1 1 0
7 7 8 8 7 7 8 8 8 8 9 9 6 6 7 7 7 7 8 8 7 7 8 8 8 8 9 9 7 7 8 8 8 8 9 9 8 8 9 9 9 9 10 10 7 7 8 8 8 8
-- calls to balanceR

Putting it another way, the number of calls to balanceR is O(n log n), but the number of rebalancings is O(n) (amortized O(1) per insert).

Which brings me to the suggestion: only inline the fast no-rebalance branch of balanceR. And to make that smaller, only the no-rebalance Bin-Bin branch.

--- a/containers/src/Data/Set/Internal.hs
+++ b/containers/src/Data/Set/Internal.hs
@@ -1881,7 +1881,14 @@ balanceL x l r = case r of
 -- balanceR is called when right subtree might have been inserted to or when
 -- left subtree might have been deleted from.
 balanceR :: a -> Set a -> Set a -> Set a
-balanceR x l r = case l of
+balanceR x l r = case (l, r) of
+ (Bin ls _ _ _, Bin rs _ _ _)
+ | rs <= delta*ls -> Bin (1+ls+rs) x l r
+ _ -> balanceR' x l r
+{-# INLINE balanceR #-}
+
+balanceR' :: a -> Set a -> Set a -> Set a
+balanceR' x l r = case l of
 Tip -> case r of
 Tip -> Bin 1 x Tip Tip
 (Bin _ _ Tip Tip) -> Bin 2 x Tip r
@@ -1894,14 +1901,12 @@ balanceR x l r = case l of
 (Bin ls _ _ _) -> case r of
 Tip -> Bin (1+ls) x l Tip
- (Bin rs rx rl rr)
- | rs > delta*ls -> case (rl, rr) of
+ (Bin rs rx rl rr) -> case (rl, rr) of
 (Bin rls rlx rll rlr, Bin rrs _ _ _)
 | rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr
 | otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)
 (_, _) -> error "Failure in Data.Set.balanceR"
- | otherwise -> Bin (1+ls+rs) x l r
-{-# NOINLINE balanceR #-}
+{-# NOINLINE balanceR' #-}

With this, we get

INLINE Bin-Bin no-rebalance
 insert: OK
 500 μs ± 30 μs, 2.6 MB allocated, 55 KB copied, 7.0 MB peak memory, 26% less than baseline

TL;DR:

To make balanceL/R calls faster, we can

Inline them. This might cause some code bloat. How bad would this be, I don't know. This improves insert by 31%.
Or, inline the "fast path" of Bin-Bin no-rebalance. A small increase in code size for a large share of the improvement above (26%).

Edit: I should mention that I expect many Set/Map functions will benefit from this, but at the moment I have tested out the idea only on insert.

Metadata

Assignees

No one assigned

Labels

Map Set performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Set/Map balance calls faster #1053

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions