Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type#438

Draft
rok wants to merge 1 commit into
apache:master from
rok:VARIABLE_SIZE_LIST
Draft

GH-437: [Format] Specify VARIABLE_SIZE_LIST Logical type #438
rok wants to merge 1 commit into
apache:master from
rok:VARIABLE_SIZE_LIST

Conversation

@rok

@rok rok commented Jun 24, 2024

Copy link
Copy Markdown
Member

This is to split VARIABLE_SIZE_LIST proposal from #241 as suggested here.

GitHub issue

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

What's the point of this?

rok commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

The intent was to define a variable sized list column type without repetition/definition levels. I suppose vector repetition level would address exactly this. We could reuse this PR for the purpose or just close it.

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

The intent was to define a variable sized list column type without repetition/definition levels

Why would it be any better than a LIST column? VECTOR is presumably for fized-size lists...

rok commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

We would want a VECTOR-like design that would allow variable-size lists without per-element definition levels.

I think that's already possible if you have a LIST group node whose child node is REQUIRED.

rok commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

Even with required elements, LIST still needs repetition levels, and offsets must be derived by decoding those levels (at least over the target range), rather than read directly?

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

Well, yes, that's how Parquet works. Trying to stuff lists of opaque byte arrays doesn't sound like a tremendous idea to me.

rok commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

Right. This would make the format less optimizable on element level, what would be other downsides?

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

The question is more whether the upsides are worth it. This hasn't been demonstrated.

rok commented Mar 5, 2026
edited
Loading

Copy link
Copy Markdown
Member Author

@rahil-c posted some performance findings on the ML, e.g. this table (I think it's all about fixed size lists). It would be nice to have your-vector-proposal-like form for list.

rahil-c commented Mar 5, 2026
edited
Loading

Copy link
Copy Markdown

@pitrou @rok This was the details of the experiment that I had tried locally when writing some vectors to a parquet file with LIST of FLOAT vs having it backed by a FIXED_LEN_BYTE_ARRAY, as well as playing around with different encodings and compressions. Note the experiment was done with the perspective of what parquet users can try today
https://lists.apache.org/thread/q9b2lbz8h9loodpzso98wnj1x2tcr20h

pitrou commented Mar 5, 2026

Copy link
Copy Markdown
Member

This is off-topic as this PR is for VARIABLE_SIZE_LIST, not FIXED_SIZE_LIST.

rahil-c reacted with thumbs up emoji

rok commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

Yes, but performance gains are likely indicative of what would be possible here. I suppose we best first see FIXED_SIZE_LIST debate play out before continuing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /