-
Notifications
You must be signed in to change notification settings - Fork 23
feat: Add LargeList support#438
Conversation
As for the ASF Generative Tooling Guidance:
Anthropic's Commercial Terms still state:
Anthropic agrees that Customer (a) retains all rights to its Inputs, and (b) owns its Outputs.
So, I can confirm that:
- The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition.
- The output is not copyrightable subject matter (and would not be even if produced by a human).
- No third party materials are included in the output.
kou
commented
May 20, 2026
@kylebarron @supermacro @pmaciolek @GeorgeLeePatterson Could you review this?
(You're in related issue/PR: #70 #299)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds end-to-end support for the Arrow LargeList data type (64-bit offsets via BigInt64Array) to the JavaScript bindings, including IPC round-tripping, visitor dispatch, builders, and test coverage.
Changes:
- Introduces
Type.LargeListandLargeList<T>with 64-bit offset handling, plusDataType.isLargeList()andmakeData()support. - Wires
visitLargeList()across core visitors (get/set/iterator/indexOf, assemblers/loaders, type comparator, JSON + FlatBuffers type/vec assembly) and IPC field-type decoding. - Adds
LargeListBuilderand expands tests to include LargeList in generated-data, builder, and visitor matrices.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/type.ts |
Adds LargeList<T> type and DataType.isLargeList() guard. |
src/enum.ts |
Adds Type.LargeList = 21. |
src/data.ts |
Adds LargeListDataProps + MakeDataVisitor.visitLargeList (BigInt64 offsets). |
src/visitor.ts |
Adds visitLargeList dispatch and dtype inference support. |
src/ipc/metadata/message.ts |
Decodes LargeList field types from IPC metadata. |
src/visitor/* |
Implements/registrations for visitLargeList across visitors (loader/assembler/get/set/etc.). |
src/util/buffer.ts |
Fixes rebaseValueOffsets to work with BigInt64Array. |
src/builder/largelist.ts |
Introduces LargeListBuilder. |
src/builder.ts / src/interfaces.ts / src/visitor/builderctor.ts |
Wires LargeListBuilder into builder/type mappings and ctor selection. |
src/Arrow.ts / src/Arrow.dom.ts |
Exports LargeList and LargeListBuilder as public API. |
test/generate-test-data.ts + unit tests |
Adds LargeList generators and includes LargeList in test matrices. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, fixed
Implement full support for the LargeList data type, which uses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB. Type and data: - Add LargeList type class with BigInt64Array offset support, DataType.isLargeList guard, and Type.LargeList enum entry - Add MakeDataVisitor.visitLargeList and LargeListDataProps overload in data.ts (widens 32-bit offsets via toBigInt64Array) - Map LargeList through TypeToDataType, TypeToBuilder, and DataTypeToBuilder in interfaces.ts Visitors (read/write/compare/mutate): - visitor.ts: base visitLargeList + Type.LargeList dispatch in getVisitFnByTypeId and inferDType - get.ts: merge getList/getLargeList into a single helper using bigIntToNumber at the offset boundary (works for both Int32Array and BigInt64Array offsets) - set.ts: same merge for setList; register visitLargeList - iterator.ts, indexof.ts: register visitLargeList (vectorIterator / indexOfValue work unchanged) - typecomparator.ts: widen compareList to List | LargeList and register for visitLargeList (structural comparison is offset-width agnostic) - typeassembler.ts, jsontypeassembler.ts: emit LargeList flatbuffer node and JSON name - vectorloader.ts: visitLargeList mirrors visitList; base readOffsets honors OffsetArrayType (BigInt64Array for LargeList) - vectorassembler.ts: generalize assembleListVector to cover LargeList via bigIntToNumber, register visitLargeList - jsonvectorassembler.ts: visitLargeList emits OFFSET via bigNumsToStrings, matching LargeUtf8 / LargeBinary - ipc/metadata/message.ts: decodeFieldType handles Type.LargeList Builders: - New src/builder/largelist.ts (LargeListBuilder), mirroring ListBuilder with BigInt() for offset accumulation and Number() coercion when passing the start index to child.set - Widen VariableWidthBuilder bound to include LargeList in builder.ts - builderctor.ts: GetBuilderCtor.visitLargeList returns LargeListBuilder - Export LargeListBuilder from Arrow.ts and Arrow.dom.ts Latent bug fix: - util/buffer.ts: rebaseValueOffsets now coerces its number offset to BigInt when the offsets array is BigInt64Array. Previously a non-zero offset on a 64-bit offsets array would TypeError on bigint += number; fix is required for LargeList IPC writes with non-zero slice offsets and also fixes the same latent issue on LargeUtf8 / LargeBinary Tests: - generate-test-data.ts: factor a shared generateListLike helper used by both generateList (Int32Array offsets) and generateLargeList (BigInt64Array offsets); truncate min/max in createVariableWidthOffsets64 so fractional stride values don't RangeError in BigInt() - generated-data-tests.ts: LargeList case added to the matrix - builders/builder-tests.ts: LargeListBuilder entry added alongside ListBuilder / FixedSizeListBuilder / MapBuilder - visitor-tests.ts: visitLargeList added to BasicVisitor / FeatureVisitor plus describe entries; fix missing comma in the import list that would have broken compilation API surface: - Export LargeList and LargeListBuilder from Arrow.ts and Arrow.dom.ts The implementation follows existing code patterns. All tests pass. Closes apache#70 Co-Authored-By: Claude Code <noreply@anthropic.com>
6246493 to
050a2ec
Compare
@domoritz
domoritz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you compare with #299 and explain what's different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a real, but pre-existing bug. I will open a separate PR for it. I am not doing the suggested workaround here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug is unrelated to this PR and does not block it; it also affects multiple other types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in #439
Could you compare with #299 and explain what's different?
I started this PR by repeating the original #299 verbatim, and then building on it for full parity of every feature available for the List type, while considering feedback to the original PR.
The scope of #299 - a LargeList type stub, not even quite an MVP: you could manually build a LargeList column and read individual values out of one already in memory. Anything beyond that - saving it to an Arrow file, loading it in IPC binary or JSON form, comparing two LargeList schemas, modifying values, or constructing one through the normal Builder API - was either silently broken or missing entirely.
The scope of this PR - LargeList as a fully-featured column type, on par with List. You can build one, read and write values, compare schemas, iterate, slice, serialize it to the Arrow IPC binary or JSON form, and round-trip it back - all with safe handling of 64-bit offsets and clear errors (rather than silent corruption) if a value would exceed JavaScript's safe integer range. Existing List, LargeUtf8, and LargeBinary types also get a latent slicing-on-write bug fixed as a side effect.
Added on top of the original PR
- IPC write path —
VectorAssembler.visitLargeList+ generalizedassembleListVectorwithbigIntToNumbercoercion - IPC read path —
VectorLoader.visitLargeList - JSON form —
JSONVectorAssembler.visitLargeList(OFFSET viabigNumsToStrings) - Type comparison —
TypeComparator.visitLargeList(widenedcompareListtoList | LargeList) - Mutation —
SetVisitor.visitLargeList(merged intosetListviabigIntToNumber) - Iterator interface declaration — completes
IteratorVisitortyping - LargeListBuilder — new
src/builder/largelist.tswithBigInt()accumulation andbigIntToNumberfor safe child-cursor narrowing - Builder plumbing — widened
VariableWidthBuilderbound;GetBuilderCtor.visitLargeList;LargeList/LargeListBuildermapped intoTypeToDataType,TypeToBuilder,DataTypeToBuilder - Public API — exported
LargeListBuilderfromArrow.ts/Arrow.dom.ts - Latent bug fix —
rebaseValueOffsetsnow bigint-safe (also fixes silent breakage onLargeUtf8/LargeBinarysliced writes) - "Get" path consolidation —
getList/getLargeListmerged withbigIntToNumberboundary coercion - Test generator robustness — shared
generateListLike;createVariableWidthOffsets64truncates min / max on entry so fractional stride doesn'tRangeErrorinBigInt() - Test coverage —
LargeListBuilderentry inbuilder-tests.ts;visitLargeListinBasicVisitor/FeatureVisitorand both describe matrices invisitor-tests.ts
GeorgeLeePatterson
commented
May 21, 2026
GeorgeLeePatterson
commented
May 21, 2026
I noticed test coverage is missing. The tests missing are:
- slicing
- IPC/JSON
- overflow tests
Also, the LargeList class has no public-facing JSDocs on overflow semantics, this would be helpful to provide.
and call out the parts that genuinely differ
Sure I'll add a reference, but what is the purpose of inspecting the #325 to compile the list of differences to it? In general, its scope was beyond LargeList hence why I did not considered as a base for the new PR. Do you imply it may have some functionality not accounted for in this PR, and is a nice-to-have?
Happy to see it land finally
Not yet, fingers crossed :)
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Karakatiza666
commented
May 21, 2026
@GeorgeLeePatterson , thanks for the pointer!
Unless you mean something else, slicing is covered by validateVector, run from test/unit/generated-data-tests.ts.
Addressed other points.
Also, I caught I missed a case for typeFromJSON in src/ipc/metadata/json.ts, and one in src/visitor/typector.ts. The only relevant thing in #325 that was missing here was the change in typector.ts. Otherwise this PR looks like a superset of #325 in terms of LargeList implementation; this PR does not include extra LargeListView tests or other *ListView additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the Arrow spec's LargeList offset type is int64_t, then it should be in the JS impl too. Clamping offsets to 53 bits isn't LargeList, it's a KindaLargeList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple 64bit - based types for which the same tradeoff is applied. I'll comment on details in a bit, but preliminarly adding true 64-bit indexing is a much larger scope, and would apply to more than just LargeList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if it was less work we would have already done it. Correctness is not the place to compromise when there is a spec and an ecosystem of other implementations we need to interoperate with.
I either wasn't involved or skimmed reviewing the LargeUtf8 and LargeBinary PRs, but I would have said the same thing there. Luckily most of the helper methods that need to be reimplemented in terms of bigint are straightforward and can be reused for LargeUtf8/LargeBinary.
@trxcllnt
trxcllnt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to use BigInt offsets everywhere, don't convert the BigInts to Numbers.
@trxcllnt I appreciate you raising this as I take the spec compliance point seriously, but I want to lay out why I think the full 64-bit indexing is a much larger conversation that goes way beyond the scope of this PR, and is not a prerequisite for landing it.
TLDR: The Number precision is not the bottleneck for what can be represented using the Large* types, the current project architecture is.
After analyzing the problem it seems to me you underestimate the amount of combined design, discussion and implementation effort needed to implement full 64-bit indexing.
The constraint isn't the offset arithmetic - BigInt64Array already stores offsets at full 64-bit precision in this PR, and the helpers around bigint are indeed straightforward.
The constraint is on the child elements buffer - the contiguous buffer holding the actual list elements that the offsets point into. All rows of a list-like column in one Arrow batch share a single such buffer, and that buffer is a JavaScript ArrayBuffer, which every major engine caps at roughly 2^32 bytes. Since offset values are positions into this buffer, the maximum offset value is bounded by the buffer's element count (~2^32 / sizeof(element)) - many orders of magnitude below 2^53, let alone 2^63. So the int64 offset width is headroom the JS runtime physically can't fill in a single contiguous child buffer.
To fully support LargeList rows beyond that cap, the child can't be a single Data's elements buffer - it has to be chunked across multiple Data objects behind a Vector-style rope. That's a change to a load-bearing invariant (Data.children[0] is a single contiguous Data) that the entire visitor framework relies on, and it would also need to flow into IPC read/write semantics, since the wire writer today assumes one contiguous child per batch.
The same reasoning applies to LargeUtf8 and LargeBinary: their backing values buffer is a Uint8Array, which has the same per-buffer ceiling. So the 53-bit narrowing isn't a quirk introduced by this PR - it's the existing project-wide policy for every Large* type, and matching it is what I meant by "feature parity with List".
At the same time the same buffer-size constraint also bounds plain List. The spec allows List up to 2^31 child elements, but JS's per-ArrayBuffer cap means a List<Float64> tops out around 2^29 elements - a quarter of the spec ceiling. So the property "JS implementation matches the spec's offset range" isn't currently true for List either, and bringing it up to that bar means the same chunked-children redesign, applied across List/LargeList/LargeUtf8/LargeBinary together. That's the design discussion I'd want to open as a separate issue rather than fold it in a wiring PR.
Unfortunately, designing and shepherding through a chunked-children rework - across all four types, across the visitor framework, the Builder layer, and the IPC reader/writer - is beyond what I can take on, neither as a standalone PR, and especially not as a part of this PR. I'm comfortable leaving this PR open and rebasing it once the community settles on a design for compliant >2^32-byte child storage that applies uniformly to the Large* family.
Otherwise, I maintain that the current PR stands on its own:
- it's fully featured;
- it introduces no new constraints or regressions and follows the existing implementation and API surface conventions;
- it is very useful to the community to close the feature gap despite not being fully compliant on the maximum element count - today the library throws when it encounters a
LargeList.
P.S. conforming producers should normalize the offsets in a batch, so in practice as long as the batch spans no more elements than can fit in the child buffer the current 2^53 arithmetic is unused headroom.
v8 supports ArrayBuffers larger than 4GiB:
# Allocate a 64GiB ArrayBuffer $ { node -p 'var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024' & }; pid=$!; sleep 1; ps -p $pid -o vsz,rss,cmd; wait $pid [1] 1431480 VSZ RSS CMD 68109056 41444 node -p var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024 67108864 [1]+ Done node -p 'var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024'
And it's not unusual for other language implementations to create RecordBatches larger than 4GiB (e.g. by using table.combine_chunks() from Python).
That said, it's unlikely anyone needs to address 8192 TiB of memory. I agree with your point that 53-bit ints are fine for indices, thanks for walking me through that 😅.
JS's per-ArrayBuffer cap means a List tops out around 2^29 elements - a quarter of the spec ceiling
Could you explain this more? List's offsets are Int32Array, which means its child should be able to support up to 2^31 individual elements. How did you arrive at 2^29 for the max child vector length?
edit:
I think I understand the confusion. You're saying if ArrayBuffer size was limited to 4GiB, then you could only represent 2^(32-4) 32-bit integers. Even if that were true, that doesn't mean the List child length is constrained, it would mean the number of lists that the ListVector represented would be constrained.
The child could have 2GiB of uint8_t values, and the ListVector offsets buffer could contain (2^28)+1) offsets, but the offset values themselves are still in the range 0..2^31. This is true in all other Arrow implementations as well.
Karakatiza666
commented
May 25, 2026
Fair point — my List<Float64> analogy conflated two different things, and your clarification helps. Compliance is about correctly interpreting offsets the wire format presents, not about the implementation realizing enough child storage to exercise the full offset range. Under that definition, List is conforming today (int32 offset values all fit in Number), and LargeList/LargeUtf8/LargeBinary aren't, because bigIntToNumber throws on offset values > 2^53.
One thing I wasn't sure about in your reply: when you said "that doesn't mean the List child length is constrained, it would mean the number of lists that the ListVector represented would be constrained" I couldn't tell whether you were pointing at a mechanism in the codebase that sidesteps the single-ArrayBuffer limit on child storage — I didn't find one, so I focused on the wire-format parsing compliance angle, which is what a new commit I pushed addresses.
The commit: VectorLoader now rebases offsets to 0 on load for LargeList, LargeUtf8, and LargeBinary. After rebasing, in-memory offsets are always bounded by the child buffer's element count (which fits in Number for anything the runtime can allocate), so downstream narrowing succeeds for any spec-conforming wire input — including sliced views with absolute, non-rebased offsets. Inputs whose offsets imply a child buffer larger than JS's ArrayBuffer cap now fail honestly at child-buffer allocation in readData, rather than later at offset narrowing — a cleaner failure mode that's a property of the JS runtime, not of the implementation.
Please let me know if this commit addresses your primary concern.
The remaining ceiling — JS's per-ArrayBuffer cap of ~2^32 bytes on a single contiguous child buffer — is allocation capacity, not interpretation, and it applies uniformly across List/LargeList/LargeUtf8/LargeBinary. Lifting it would mean a chunked-children redesign (child as Vector rather than single Data<U>), which is a substantial design change. As I previously expressed, I think that's a separate, more ambitious effort and deserves its own issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment and implementation. If the producer already rebased the offsets buffer values relative to the slice offset, the first offset value should always be zero. If it isn't, it seems that's the real bug?
trxcllnt
commented
May 26, 2026
I couldn't tell whether you were pointing at a mechanism in the codebase that sidesteps the single-ArrayBuffer limit on child storage
...
The remaining ceiling — JS's per-ArrayBuffer cap of ~2^32 bytes on a single contiguous child buffer — is allocation capacity, not interpretation, and it applies uniformly across List/LargeList/LargeUtf8/LargeBinary. Lifting it would mean a chunked-children redesign (child as Vector rather than single Data), which is a substantial design change. As I previously expressed, I think that's a separate, more ambitious effort and deserves its own issue.
As demonstrated by the example I pasted above, there is no uniform 4GiB cap on ArrayBuffer size in JS. The only limit to List child size is that the value offsets type is Int32, thus the child can only contain 2^31 elements. This means the ListVector could represent a column of entirely single-element lists, but there could only be 2^31 of them due to the offset type being Int32, not any limit in ArrayBuffer allocation size.
I focused on the wire-format parsing compliance angle, which is what a new commit I pushed addresses
I don't think this commit is right. The reader should be zero-copy, and even if it was the correct thing to do, rebasing offsets when reading is not zero-copy.
11dc172 to
1e0c320
Compare
Karakatiza666
commented
May 31, 2026
You're right, I've pushed the updated commit. This version keeps the reader zero-copy for everything under 2^53 and only falls back to a copy for when the fallback rebase is necessary:
- Final offset ≤ 2^53 (common): offsets returned as a zero-copy BigInt64Array view over the wire bytes — no allocation, no copy. Narrowing to number stays lazy (at element access) and is always lossless.
- Final offset > 2^53, span ≤ 2^53 (slice with absolute offsets): rebased to 0 in bigint precision into a fresh buffer — the only path that copies — so the values stay narrowable.
- Final offset > 2^53, span > 2^53: unreachable in practice — needs a child buffer > 2^53 elements present in the message, far past any engine's allocation limit.
- bigIntToNumber 2^53 guard: after the rebase, never tripped by well-formed wire data; only reachable via hand-built in-memory Data (makeData, not rebased by design) — a loud failure for corrupt input.
- Offset < 2^53 but past the child buffer: clamps silently in subarray/slice (pre-existing - built in methods behavior, unchanged).
Karakatiza666
commented
Jun 3, 2026
@trxcllnt I'd appreciate it if you helped me carry this PR through review!
kou
commented
Jun 4, 2026
Could you check CI failures?
1e0c320 to
09fbbb1
Compare
Karakatiza666
commented
Jun 4, 2026
I rseolved the issue, I needed to avoid running the new test under UMD build. The test is still executed in other builds
LargeList/LargeUtf8/LargeBinary offsets are read as a zero-copy BigInt64Array view over the wire bytes and narrowed to a JS number lazily, at element access. When the final offset exceeds Number.MAX_SAFE_INTEGER — a slice serialized with absolute, non-rebased offsets — readLargeOffsets rebases to 0 in bigint precision so the values stay narrowable. This is the only path that copies; offsets within the safe-integer range are passed through untouched. The bigIntToNumber guard remains for offsets > 2^53 that survive to access (only reachable via hand-built in-memory Data), failing loudly on corrupt input rather than truncating.
09fbbb1 to
7b0727d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any language implementations that actually produce an IPC stream with, "a sliced array serialized with absolute (non-rebased) offsets, whose values can exceed Number.MAX_SAFE_INTEGER even when the referenced span is small?" IIUC rebasing valueOffsets to zero is part of IPC writers across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude said the same thing; it sounded like it's more of a concern with manually, in-memory generated data. This fallback sounded to me like a low-cost defensive measure; LMK if you'd rather have it fail loudly to avoid adding the fallback logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format spec for variable-sized layouts states:
Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.
While this sounds permissive, I don't know of any language implementations that don't strictly enforce this.
There are many possible ways to construct non-spec-conforming IPC streams, but it either isn't practical to attempt to catch them all, or we assume the user knows and has a very good reason for doing what they're doing.
For example, we also don't validate whether the offsets buffer for List, Utf8, or Binary vectors start with offset 0, because we assume the IPC stream is from a conforming implementation.
There are a few instances where I've taken advantage of ambiguities or UB in the spec to achieve specific results when working with certain technologies. For example, I've previously implemented a custom in-memory layout with large contiguous buffers on GPUs that included the IPC metadata for reading the contiguous buffers in chunks. I intentionally didn't rebase valueOffsets in that implementation, because the consumer mapped the valueOffsets to VBOs for GPU shaders. While this is technically UB in the spec, the CPU IPC streaming code shouldn't prohibit doing this.
I think we should remove this function and revert to calling readOffsets() like before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I reverted to readOffsets() and removed readLargeOffsets. No widening of readOffsets was needed: it returns the raw offset bytes and makeData already picks toBigInt64Array for the Large* types, so they still get true int64 offsets.
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
Thank you for helping to land this PR!
Uh oh!
There was an error while loading. Please reload this page.
This PR was co-authored with Claude Code.
Summary
This PR builds on an unresolved #299 to implement full support for the
LargeListdata type in Apache Arrow JavaScript bindings.LargeListuses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB.Where possible, the code size was reduced by distilling helpers used in both
ListandLargeList.Related Issues
Closes #70
Implementation Details
Core Type System
Type.LargeList = 21enum valueLargeList<T>class withBigInt64Arrayoffset supportDataType.isLargeList()type guardLargeListDataPropsinterface andMakeDataVisitor.visitLargeList(widens 32-bit offsets viatoBigInt64Array)LargeListandLargeListBuilderintoTypeToDataType,TypeToBuilder, andDataTypeToBuilderininterfaces.tsVisitor Pattern Implementation
Wired
visitLargeList()across every visitor, factoring shared helpers where the offset width was the only difference:GetVisitor/SetVisitor: mergedgetList/setListinto single helpers usingbigIntToNumberat the offset boundary — one implementation covers both List and LargeListIteratorVisitor,IndexOfVisitor: registervisitLargeList(the generic implementations are offset-width agnostic)TypeComparator: widened compareList toList | LargeList(structural comparison only)VectorAssembler: generalizedassembleListVectorto coerce begin/end viabigIntToNumber; registersvisitLargeListVectorLoader:visitLargeListmirrorsvisitList; basereadOffsetsalready honorsOffsetArrayType(BigInt64Array)JSONVectorAssembler: emitsOFFSETviabigNumsToStrings, matching theLargeUtf8/LargeBinarypatternTypeAssembler/JSONTypeAssembler:FlatBuffers+ JSON type serializationIPC Support
ipc/metadata/message.ts:decodeFieldTypehandlesType.LargeListLatent Bug Fix
util/buffer.ts:rebaseValueOffsetsnow coerces its number offset toBigIntwhen the offsets array isBigInt64Array. Previously a non-zero offset on a 64-bit offsets array wouldTypeErroron bigint += number — required forLargeListIPC writes on sliced data, and also fixes the same latent issue forLargeUtf8/LargeBinary.Builders
src/builder/largelist.ts(LargeListBuilder), mirroringListBuilderwithBigInt()for offset accumulation andNumber()coercion when passing the start index tochild.setVariableWidthBuilderbound to includeLargeListinbuilder.tsGetBuilderCtor.visitLargeListreturnsLargeListBuilderTesting
test/generate-test-data.ts:generateListLikehelper used by bothgenerateList(Int32) andgenerateLargeList(BigInt64)createVariableWidthOffsets64; truncatesmin/maxat entry so fractional stride fromchildVec.length / (length - nullCount)doesn'tRangeErrorinBigInt()test/unit/generated-data-tests.ts:LargeListadded to the matrixtest/unit/builders/builder-tests.ts:LargeListBuilderentry added alongsideListBuilder/FixedSizeListBuilder/MapBuildertest/unit/visitor-tests.ts:visitLargeListadded toBasicVisitor/FeatureVisitorand to both describe matricesPublic API
LargeListandLargeListBuilderfromsrc/Arrow.tsandsrc/Arrow.dom.tsTest Plan
All existing tests continue to pass, plus the
LargeListpath is exercised by:get/set/iterator/indexOf/slice/concat/ IPC round-tripBasicVisitor+FeatureVisitor)JSONVectorAssembler/JSONVectorLoader)All tests across 45 suites pass.
The tests were run with:
Checklist
get/set/iterator/indexOf/TypeComparator/VectorAssembler/VectorLoader/JSONVectorAssembler/TypeAssembler/JSONTypeAssembler)LargeListBuilderadded and wired throughGetBuilderCtor+interfaces.tsrebaseValueOffsetsbigint bug fixedNotes
LargeListsupport: IPC read/write (binary + JSON form), in-memory access and mutation, type comparison, and construction viaLargeListBuilder— parallel to the existingListtype, just with 64-bit offsets.BigInt64Arrayend-to-end). The only narrowing happens at JS-runtime boundaries whereData.sliceaccepts number — identical to theLargeUtf8/LargeBinarypolicy upstreamList/LargeListonly where the offset width was the sole difference andbigIntToNumbercoercion at the boundary made the merge non-confusing;LargeListBuilderstays separate because theBigInt()/Number()coercions in_flushPendingwould obscure a merged version