-
Notifications
You must be signed in to change notification settings - Fork 493
Conversation
pitrou
commented
Mar 5, 2026
What's the point of this?
rok
commented
Mar 5, 2026
The intent was to define a variable sized list column type without repetition/definition levels. I suppose vector repetition level would address exactly this. We could reuse this PR for the purpose or just close it.
pitrou
commented
Mar 5, 2026
The intent was to define a variable sized list column type without repetition/definition levels
Why would it be any better than a LIST column? VECTOR is presumably for fized-size lists...
rok
commented
Mar 5, 2026
We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.
pitrou
commented
Mar 5, 2026
We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.
I think that's already possible if you have a LIST group node whose child node is REQUIRED.
rok
commented
Mar 5, 2026
Even with required elements, LIST still needs repetition levels, and offsets must be derived by decoding those levels (at least over the target range), rather than read directly?
pitrou
commented
Mar 5, 2026
Well, yes, that's how Parquet works. Trying to stuff lists of opaque byte arrays doesn't sound like a tremendous idea to me.
rok
commented
Mar 5, 2026
Right. This would make the format less optimizable on element level, what would be other downsides?
pitrou
commented
Mar 5, 2026
The question is more whether the upsides are worth it. This hasn't been demonstrated.
@rahil-c posted some performance findings on the ML, e.g. this table (I think it's all about fixed size lists). It would be nice to have your-vector-proposal-like form for list.
@pitrou @rok This was the details of the experiment that I had tried locally when writing some vectors to a parquet file with LIST of FLOAT vs having it backed by a FIXED_LEN_BYTE_ARRAY, as well as playing around with different encodings and compressions. Note the experiment was done with the perspective of what parquet users can try today
https://lists.apache.org/thread/q9b2lbz8h9loodpzso98wnj1x2tcr20h
pitrou
commented
Mar 5, 2026
This is off-topic as this PR is for VARIABLE_SIZE_LIST, not FIXED_SIZE_LIST.
rok
commented
Mar 5, 2026
Yes, but performance gains are likely indicative of what would be possible here. I suppose we best first see FIXED_SIZE_LIST debate play out before continuing here.
This is to split VARIABLE_SIZE_LIST proposal from #241 as suggested here.
GitHub issue
Commits
Documentation