Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

refactor: allow switching to bitpack inside RLE #5595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Xuanwo wants to merge 1 commit into main
base: main
Choose a base branch
Loading
from xuanwo/int-score-compress

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Dec 30, 2025

In some cases, we initially considered using RLE but ultimately found that the data is better stored with bitpacking. This PR implements that change.

Metric Parquet (reference) Lance (before change) Lance (after change) Delta (after vs before)
int_score compressed size (bytes) 56,035 377,838 71,556 -306,282 (-81.06%)
int_score vs Parquet (ratio) 1.00x 6.74x 1.28x -5.47x
Lance chosen encoding (hint) RLE_DICTIONARY (plus RLE, PLAIN, SNAPPY) rle inline_bitpacking n/a

Parts of this PR were drafted with assistance from Codex (with gpt-5.2) and fully reviewed and edited by me. I take full responsibility for all changes.

Copy link
Contributor

Code Review

Summary: This PR adds logic to prefer bitpacking over RLE when bitpacking produces smaller output. The approach is sound and the test coverage is good.

P1 Issue: Estimation formula may underestimate bitpacking size

In estimate_inline_bitpacking_bytes, the calculation appears to assume all chunks are full 1024-element chunks. The words_per_chunk is hardcoded to 1 (for the bit-width header), but the actual implementation in InlineBitpacking::bitpack_chunked stores the bit-width as a single element of type T (e.g., 8 bytes for u64), not 1 byte.

Looking at compression.rs:241-247:

let words_per_chunk: u128 = 1;
let word_bytes: u128 = (bits / 8) as u128;
// ...
let packed_words = (1024u128 * bit_width) / (bits as u128);
total_words = total_words.saturating_add(words_per_chunk.saturating_add(packed_words));

This correctly accounts for the header as 1 word (element) per chunk plus the packed data words. However, the comparison should be bitpack_bytes < rle_bytes not bitpack_bytes < rle_bytes returning None for RLE. The current logic is:

  • If bitpacking is smaller than RLE, skip RLE (return None)
  • Then bitpacking will be tried separately

This seems correct, but I'd suggest adding a brief comment in try_rle_for_mini_block explaining that we're checking if bitpacking would be better to avoid selecting RLE when it's not optimal.

Suggestion (not blocking)

Consider adding a comment in the RLE function explaining the fallback to bitpacking check:

// If bitpacking would produce smaller output than RLE, skip RLE
// and let the subsequent bitpacking check handle compression.

Test coverage

The new test test_low_cardinality_prefers_bitpacking_over_rle is well-designed and validates the behavior. The modification to test_rle_encoding_verification to use i32::MIN values ensures RLE is still tested when bitpacking can't help (high bit set means bit-width equals type width).

LGTM with the optional suggestion above.

Copy link

codecov bot commented Dec 30, 2025

Codecov Report

❌ Patch coverage is 98.21429% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/compression.rs 98.21% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

AltStyle によって変換されたページ (->オリジナル) /