-
Notifications
You must be signed in to change notification settings - Fork 506
refactor: allow switching to bitpack inside RLE #5595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Code Review
Summary: This PR adds logic to prefer bitpacking over RLE when bitpacking produces smaller output. The approach is sound and the test coverage is good.
P1 Issue: Estimation formula may underestimate bitpacking size
In estimate_inline_bitpacking_bytes, the calculation appears to assume all chunks are full 1024-element chunks. The words_per_chunk is hardcoded to 1 (for the bit-width header), but the actual implementation in InlineBitpacking::bitpack_chunked stores the bit-width as a single element of type T (e.g., 8 bytes for u64), not 1 byte.
Looking at compression.rs:241-247:
let words_per_chunk: u128 = 1; let word_bytes: u128 = (bits / 8) as u128; // ... let packed_words = (1024u128 * bit_width) / (bits as u128); total_words = total_words.saturating_add(words_per_chunk.saturating_add(packed_words));
This correctly accounts for the header as 1 word (element) per chunk plus the packed data words. However, the comparison should be bitpack_bytes < rle_bytes not bitpack_bytes < rle_bytes returning None for RLE. The current logic is:
- If bitpacking is smaller than RLE, skip RLE (return
None) - Then bitpacking will be tried separately
This seems correct, but I'd suggest adding a brief comment in try_rle_for_mini_block explaining that we're checking if bitpacking would be better to avoid selecting RLE when it's not optimal.
Suggestion (not blocking)
Consider adding a comment in the RLE function explaining the fallback to bitpacking check:
// If bitpacking would produce smaller output than RLE, skip RLE // and let the subsequent bitpacking check handle compression.
Test coverage
The new test test_low_cardinality_prefers_bitpacking_over_rle is well-designed and validates the behavior. The modification to test_rle_encoding_verification to use i32::MIN values ensures RLE is still tested when bitpacking can't help (high bit set means bit-width equals type width).
LGTM with the optional suggestion above.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change.
int_scorecompressed size (bytes)int_scorevs Parquet (ratio)RLE_DICTIONARY(plusRLE,PLAIN,SNAPPY)rleinline_bitpackingParts of this PR were drafted with assistance from Codex (with
gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.