feat: Add LargeList support#438

Merged

kou merged 4 commits into

apache:main from

Karakatiza666:main

Jun 5, 2026

Merged

feat: Add LargeList support #438
kou merged 4 commits into
apache:main from
Karakatiza666:main

Conversation

@Karakatiza666

@Karakatiza666 Karakatiza666 commented May 19, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

This PR was co-authored with Claude Code.

Summary

This PR builds on an unresolved #299 to implement full support for the LargeList data type in Apache Arrow JavaScript bindings. LargeList uses 64-bit offsets (BigInt64Array) instead of 32-bit offsets, enabling list values larger than 2GB.

Where possible, the code size was reduced by distilling helpers used in both List and LargeList.

Related Issues

Closes #70

Implementation Details

Core Type System

Added Type.LargeList = 21 enum value
Implemented LargeList<T> class with BigInt64Array offset support
Added DataType.isLargeList() type guard
Added LargeListDataProps interface and MakeDataVisitor.visitLargeList (widens 32-bit offsets via toBigInt64Array)
Mapped LargeList and LargeListBuilder into TypeToDataType, TypeToBuilder, and DataTypeToBuilder in interfaces.ts

Visitor Pattern Implementation

Wired visitLargeList() across every visitor, factoring shared helpers where the offset width was the only difference:

GetVisitor / SetVisitor: merged getList / setList into single helpers using bigIntToNumber at the offset boundary — one implementation covers both List and LargeList
IteratorVisitor, IndexOfVisitor: register visitLargeList (the generic implementations are offset-width agnostic)
TypeComparator: widened compareList to List | LargeList (structural comparison only)
VectorAssembler: generalized assembleListVector to coerce begin/end via bigIntToNumber; registers visitLargeList
VectorLoader: visitLargeList mirrors visitList; base readOffsets already honors OffsetArrayType (BigInt64Array)
JSONVectorAssembler: emits OFFSET via bigNumsToStrings, matching the LargeUtf8 / LargeBinary pattern
TypeAssembler / JSONTypeAssembler: FlatBuffers + JSON type serialization

IPC Support

ipc/metadata/message.ts: decodeFieldType handles Type.LargeList
Read and write paths both round-trip via the assembler/loader registrations above

Latent Bug Fix

util/buffer.ts: rebaseValueOffsets now coerces its number offset to BigInt when the offsets array is BigInt64Array. Previously a non-zero offset on a 64-bit offsets array would TypeError on bigint += number — required for LargeList IPC writes on sliced data, and also fixes the same latent issue for LargeUtf8 / LargeBinary.

Builders

New src/builder/largelist.ts (LargeListBuilder), mirroring ListBuilder with BigInt() for offset accumulation and Number() coercion when passing the start index to child.set
Widened VariableWidthBuilder bound to include LargeList in builder.ts
GetBuilderCtor.visitLargeList returns LargeListBuilder

Testing

test/generate-test-data.ts:
- Factored a shared generateListLike helper used by both generateList (Int32) and generateLargeList (BigInt64)
- Added createVariableWidthOffsets64; truncates min / max at entry so fractional stride from childVec.length / (length - nullCount) doesn't RangeError in BigInt()
test/unit/generated-data-tests.ts: LargeList added to the matrix
test/unit/builders/builder-tests.ts: LargeListBuilder entry added alongside ListBuilder / FixedSizeListBuilder / MapBuilder
test/unit/visitor-tests.ts: visitLargeList added to BasicVisitor / FeatureVisitor and to both describe matrices

Public API

Exported LargeList and LargeListBuilder from src/Arrow.ts and src/Arrow.dom.ts

Test Plan

All existing tests continue to pass, plus the LargeList path is exercised by:

✅ Generated-data matrix: get / set / iterator / indexOf / slice / concat / IPC round-trip
✅ Builder matrix: no-nulls / with-nulls / length=518
✅ Visitor dispatch (BasicVisitor + FeatureVisitor)
✅ IPC stream round-trip (16 IPC suites green, including JSON form via JSONVectorAssembler / JSONVectorLoader)

All tests across 45 suites pass.

The tests were run with:

npx jest --config jestconfigs/jest.src.config.js

Checklist

Implementation follows existing code patterns
All visitor methods implemented (get / set / iterator / indexOf / TypeComparator / VectorAssembler / VectorLoader / JSONVectorAssembler / TypeAssembler / JSONTypeAssembler)
IPC serialization/deserialization support added (binary + JSON form)
LargeListBuilder added and wired through GetBuilderCtor + interfaces.ts
Latent rebaseValueOffsets bigint bug fixed
Comprehensive tests added using existing test framework
All tests passing
Public API exports added
No breaking changes

Notes

This implementation provides full LargeList support: IPC read/write (binary + JSON form), in-memory access and mutation, type comparison, and construction via LargeListBuilder — parallel to the existing List type, just with 64-bit offsets.
Storage and wire format are honest 64-bit (BigInt64Array end-to-end). The only narrowing happens at JS-runtime boundaries where Data.slice accepts number — identical to the LargeUtf8 / LargeBinary policy upstream
Helpers were merged across List/LargeList only where the offset width was the sole difference and bigIntToNumber coercion at the boundary made the merge non-confusing; LargeListBuilder stays separate because the BigInt() / Number() coercions in _flushPending would obscure a merged version
Another relevant PR with a subset of changes here, but with a different scope (includes changes relevant to BinaryView, Utf8View, ListVIew, LargeListView): feat: Add LargeList type support #325

@Karakatiza666 Karakatiza666 mentioned this pull request

May 19, 2026

Switch ad-hoc queries to arrow_ipc format feldera/feldera#4240

Merged

@Karakatiza666

Karakatiza666 commented May 19, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

As for the ASF Generative Tooling Guidance:

Anthropic's Commercial Terms still state:

Anthropic agrees that Customer (a) retains all rights to its Inputs, and (b) owns its Outputs.

So, I can confirm that:

The terms and conditions of the generative AI tool do not place any restrictions on use of the output that would be inconsistent with the Open Source Definition.
The output is not copyrightable subject matter (and would not be even if produced by a human).
No third party materials are included in the output.

@kou kou requested a review from Copilot

May 20, 2026 01:25

Copilot started reviewing on behalf of kou

May 20, 2026 01:25

View session

@kou

kou commented May 20, 2026

Copy link

Copy Markdown

Member

@kylebarron @supermacro @pmaciolek @GeorgeLeePatterson Could you review this?
(You're in related issue/PR: #70 #299)

Copilot AI reviewed

May 20, 2026

View reviewed changes

Copilot AI left a comment

Copy link

Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-to-end support for the Arrow LargeList data type (64-bit offsets via BigInt64Array) to the JavaScript bindings, including IPC round-tripping, visitor dispatch, builders, and test coverage.

Changes:

Introduces Type.LargeList and LargeList<T> with 64-bit offset handling, plus DataType.isLargeList() and makeData() support.
Wires visitLargeList() across core visitors (get/set/iterator/indexOf, assemblers/loaders, type comparator, JSON + FlatBuffers type/vec assembly) and IPC field-type decoding.
Adds LargeListBuilder and expands tests to include LargeList in generated-data, builder, and visitor matrices.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/type.ts`	Adds `LargeList<T>` type and `DataType.isLargeList()` guard.
`src/enum.ts`	Adds `Type.LargeList = 21`.
`src/data.ts`	Adds `LargeListDataProps` + `MakeDataVisitor.visitLargeList` (BigInt64 offsets).
`src/visitor.ts`	Adds `visitLargeList` dispatch and dtype inference support.
`src/ipc/metadata/message.ts`	Decodes `LargeList` field types from IPC metadata.
`src/visitor/*`	Implements/registrations for `visitLargeList` across visitors (loader/assembler/get/set/etc.).
`src/util/buffer.ts`	Fixes `rebaseValueOffsets` to work with `BigInt64Array`.
`src/builder/largelist.ts`	Introduces `LargeListBuilder`.
`src/builder.ts` / `src/interfaces.ts` / `src/visitor/builderctor.ts`	Wires `LargeListBuilder` into builder/type mappings and ctor selection.
`src/Arrow.ts` / `src/Arrow.dom.ts`	Exports `LargeList` and `LargeListBuilder` as public API.
`test/generate-test-data.ts` + unit tests	Adds LargeList generators and includes LargeList in test matrices.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/builder/largelist.ts

Comment on lines +45 to +50

const v = value as T['TValue'];

const n = v.length;

const start = Number(offsets.set(index, BigInt(n)).buffer[index]);

for (let i = -1; ++i < n;) {

child.set(start + i, v[i]);

}

@Karakatiza666 Karakatiza666 May 20, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed

@Karakatiza666 @claude


 feat: Add LargeList support for JavaScript bindings

050a2ec

Implement full support for the LargeList data type, which uses 64-bit
offsets (BigInt64Array) instead of 32-bit offsets, enabling list values
larger than 2GB.
Type and data:
- Add LargeList type class with BigInt64Array offset support,
 DataType.isLargeList guard, and Type.LargeList enum entry
- Add MakeDataVisitor.visitLargeList and LargeListDataProps overload in
 data.ts (widens 32-bit offsets via toBigInt64Array)
- Map LargeList through TypeToDataType, TypeToBuilder, and
 DataTypeToBuilder in interfaces.ts
Visitors (read/write/compare/mutate):
- visitor.ts: base visitLargeList + Type.LargeList dispatch in
 getVisitFnByTypeId and inferDType
- get.ts: merge getList/getLargeList into a single helper using
 bigIntToNumber at the offset boundary (works for both Int32Array and
 BigInt64Array offsets)
- set.ts: same merge for setList; register visitLargeList
- iterator.ts, indexof.ts: register visitLargeList (vectorIterator /
 indexOfValue work unchanged)
- typecomparator.ts: widen compareList to List | LargeList and register
 for visitLargeList (structural comparison is offset-width agnostic)
- typeassembler.ts, jsontypeassembler.ts: emit LargeList flatbuffer
 node and JSON name
- vectorloader.ts: visitLargeList mirrors visitList; base readOffsets
 honors OffsetArrayType (BigInt64Array for LargeList)
- vectorassembler.ts: generalize assembleListVector to cover LargeList
 via bigIntToNumber, register visitLargeList
- jsonvectorassembler.ts: visitLargeList emits OFFSET via
 bigNumsToStrings, matching LargeUtf8 / LargeBinary
- ipc/metadata/message.ts: decodeFieldType handles Type.LargeList
Builders:
- New src/builder/largelist.ts (LargeListBuilder), mirroring ListBuilder
 with BigInt() for offset accumulation and Number() coercion when
 passing the start index to child.set
- Widen VariableWidthBuilder bound to include LargeList in builder.ts
- builderctor.ts: GetBuilderCtor.visitLargeList returns LargeListBuilder
- Export LargeListBuilder from Arrow.ts and Arrow.dom.ts
Latent bug fix:
- util/buffer.ts: rebaseValueOffsets now coerces its number offset to
 BigInt when the offsets array is BigInt64Array. Previously a non-zero
 offset on a 64-bit offsets array would TypeError on bigint += number;
 fix is required for LargeList IPC writes with non-zero slice offsets
 and also fixes the same latent issue on LargeUtf8 / LargeBinary
Tests:
- generate-test-data.ts: factor a shared generateListLike helper used
 by both generateList (Int32Array offsets) and generateLargeList
 (BigInt64Array offsets); truncate min/max in
 createVariableWidthOffsets64 so fractional stride values don't
 RangeError in BigInt()
- generated-data-tests.ts: LargeList case added to the matrix
- builders/builder-tests.ts: LargeListBuilder entry added alongside
 ListBuilder / FixedSizeListBuilder / MapBuilder
- visitor-tests.ts: visitLargeList added to BasicVisitor / FeatureVisitor
 plus describe entries; fix missing comma in the import list that
 would have broken compilation
API surface:
- Export LargeList and LargeListBuilder from Arrow.ts and Arrow.dom.ts
The implementation follows existing code patterns. All tests pass.
Closes apache#70
Co-Authored-By: Claude Code <noreply@anthropic.com>

@Karakatiza666 Karakatiza666 force-pushed the main branch from 6246493 to 050a2ec Compare

May 20, 2026 06:58

domoritz

domoritz reviewed

May 20, 2026

View reviewed changes

@domoritz domoritz left a comment

Copy link

Copy Markdown

Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you compare with #299 and explain what's different?

@kou kou requested a review from Copilot

May 21, 2026 00:50

Copilot started reviewing on behalf of kou

May 21, 2026 00:50

View session

Copilot AI reviewed

May 21, 2026

View reviewed changes

Copilot AI left a comment

Copy link

Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

src/visitor/jsonvectorassembler.ts

}

public visitLargeList<T extends LargeList>(data: Data<T>) {

return {

'OFFSET': [...bigNumsToStrings(data.valueOffsets, 2)],

@Karakatiza666 Karakatiza666 May 21, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real, but pre-existing bug. I will open a separate PR for it. I am not doing the suggested workaround here.

@Karakatiza666 Karakatiza666 May 21, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug is unrelated to this PR and does not block it; it also affects multiple other types.

@Karakatiza666 Karakatiza666 May 21, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in #439

@Karakatiza666

Karakatiza666 commented May 21, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

Could you compare with #299 and explain what's different?

I started this PR by repeating the original #299 verbatim, and then building on it for full parity of every feature available for the List type, while considering feedback to the original PR.

The scope of #299 - a LargeList type stub, not even quite an MVP: you could manually build a LargeList column and read individual values out of one already in memory. Anything beyond that - saving it to an Arrow file, loading it in IPC binary or JSON form, comparing two LargeList schemas, modifying values, or constructing one through the normal Builder API - was either silently broken or missing entirely.

The scope of this PR - LargeList as a fully-featured column type, on par with List. You can build one, read and write values, compare schemas, iterate, slice, serialize it to the Arrow IPC binary or JSON form, and round-trip it back - all with safe handling of 64-bit offsets and clear errors (rather than silent corruption) if a value would exceed JavaScript's safe integer range. Existing List, LargeUtf8, and LargeBinary types also get a latent slicing-on-write bug fixed as a side effect.

Added on top of the original PR

IPC write path — VectorAssembler.visitLargeList + generalized assembleListVector with bigIntToNumber coercion
IPC read path — VectorLoader.visitLargeList
JSON form — JSONVectorAssembler.visitLargeList (OFFSET via bigNumsToStrings)
Type comparison — TypeComparator.visitLargeList (widened compareList to List | LargeList)
Mutation — SetVisitor.visitLargeList (merged into setList via bigIntToNumber)
Iterator interface declaration — completes IteratorVisitor typing
LargeListBuilder — new src/builder/largelist.ts with BigInt() accumulation and bigIntToNumber for safe child-cursor narrowing
Builder plumbing — widened VariableWidthBuilder bound; GetBuilderCtor.visitLargeList; LargeList / LargeListBuilder mapped into TypeToDataType, TypeToBuilder, DataTypeToBuilder
Public API — exported LargeListBuilder from Arrow.ts / Arrow.dom.ts
Latent bug fix — rebaseValueOffsets now bigint-safe (also fixes silent breakage on LargeUtf8 / LargeBinary sliced writes)
"Get" path consolidation — getList / getLargeList merged with bigIntToNumber boundary coercion
Test generator robustness — shared generateListLike; createVariableWidthOffsets64 truncates min / max on entry so fractional stride doesn't RangeError in BigInt()
Test coverage — LargeListBuilder entry in builder-tests.ts; visitLargeList in BasicVisitor / FeatureVisitor and both describe matrices in visitor-tests.ts

@GeorgeLeePatterson

GeorgeLeePatterson commented May 21, 2026

Copy link

Copy Markdown

Contributor

It looks like a lot of this PR's code overlaps with #325, a PR I created back in November. Happy to see it land finally, additional work was needed.

Could you add a reference to #325 in the PR description / commit, and call out the parts that genuinely differ?

@GeorgeLeePatterson

GeorgeLeePatterson commented May 21, 2026

Copy link

Copy Markdown

Contributor

I noticed test coverage is missing. The tests missing are:

slicing
IPC/JSON
overflow tests

Also, the LargeList class has no public-facing JSDocs on overflow semantics, this would be helpful to provide.

@Karakatiza666

Karakatiza666 commented May 21, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

and call out the parts that genuinely differ

Sure I'll add a reference, but what is the purpose of inspecting the #325 to compile the list of differences to it? In general, its scope was beyond LargeList hence why I did not considered as a base for the new PR. Do you imply it may have some functionality not accounted for in this PR, and is a nice-to-have?

Happy to see it land finally

Not yet, fingers crossed :)

@Karakatiza666


 Review fixes

199b996

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>

@Karakatiza666

Karakatiza666 commented May 21, 2026

Copy link

Copy Markdown

Contributor Author

@GeorgeLeePatterson , thanks for the pointer!
Unless you mean something else, slicing is covered by validateVector, run from test/unit/generated-data-tests.ts.
Addressed other points.

Also, I caught I missed a case for typeFromJSON in src/ipc/metadata/json.ts, and one in src/visitor/typector.ts. The only relevant thing in #325 that was missing here was the change in typector.ts. Otherwise this PR looks like a superset of #325 in terms of LargeList implementation; this PR does not include extra LargeListView tests or other *ListView additions.

@kou kou requested a review from Copilot

May 21, 2026 20:42

@kou kou changed the title ~~(削除) feat: Add LargeList support for JavaScript bindings (削除ここまで)~~ (追記) feat: Add LargeList support (追記ここまで)

May 21, 2026

Copilot started reviewing on behalf of kou

May 21, 2026 20:42

View session

@kou kou requested a review from trxcllnt

May 21, 2026 20:43

Copilot AI reviewed

May 21, 2026

Copilot AI left a comment

Copy link

Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

trxcllnt

trxcllnt reviewed

May 22, 2026

View reviewed changes

src/visitor/vectorassembler.ts

Comment on lines +242 to +243

const begin = bigIntToNumber(valueOffsets[0]);

const end = bigIntToNumber(valueOffsets[length]);

@trxcllnt trxcllnt May 22, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Arrow spec's LargeList offset type is int64_t, then it should be in the JS impl too. Clamping offsets to 53 bits isn't LargeList, it's a KindaLargeList.

@Karakatiza666 Karakatiza666 May 22, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple 64bit - based types for which the same tradeoff is applied. I'll comment on details in a bit, but preliminarly adding true 64-bit indexing is a much larger scope, and would apply to more than just LargeList.

@trxcllnt trxcllnt May 22, 2026

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it was less work we would have already done it. Correctness is not the place to compromise when there is a spec and an ecosystem of other implementations we need to interoperate with.

I either wasn't involved or skimmed reviewing the LargeUtf8 and LargeBinary PRs, but I would have said the same thing there. Luckily most of the helper methods that need to be reimplemented in terms of bigint are straightforward and can be reused for LargeUtf8/LargeBinary.

trxcllnt

trxcllnt requested changes

May 22, 2026

View reviewed changes

@trxcllnt trxcllnt left a comment

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use BigInt offsets everywhere, don't convert the BigInts to Numbers.

@Karakatiza666

Karakatiza666 commented May 22, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

@trxcllnt I appreciate you raising this as I take the spec compliance point seriously, but I want to lay out why I think the full 64-bit indexing is a much larger conversation that goes way beyond the scope of this PR, and is not a prerequisite for landing it.

TLDR: The Number precision is not the bottleneck for what can be represented using the Large* types, the current project architecture is.

After analyzing the problem it seems to me you underestimate the amount of combined design, discussion and implementation effort needed to implement full 64-bit indexing.
The constraint isn't the offset arithmetic - BigInt64Array already stores offsets at full 64-bit precision in this PR, and the helpers around bigint are indeed straightforward.

The constraint is on the child elements buffer - the contiguous buffer holding the actual list elements that the offsets point into. All rows of a list-like column in one Arrow batch share a single such buffer, and that buffer is a JavaScript ArrayBuffer, which every major engine caps at roughly 2^32 bytes. Since offset values are positions into this buffer, the maximum offset value is bounded by the buffer's element count (~2^32 / sizeof(element)) - many orders of magnitude below 2^53, let alone 2^63. So the int64 offset width is headroom the JS runtime physically can't fill in a single contiguous child buffer.

To fully support LargeList rows beyond that cap, the child can't be a single Data's elements buffer - it has to be chunked across multiple Data objects behind a Vector-style rope. That's a change to a load-bearing invariant (Data.children[0] is a single contiguous Data) that the entire visitor framework relies on, and it would also need to flow into IPC read/write semantics, since the wire writer today assumes one contiguous child per batch.

The same reasoning applies to LargeUtf8 and LargeBinary: their backing values buffer is a Uint8Array, which has the same per-buffer ceiling. So the 53-bit narrowing isn't a quirk introduced by this PR - it's the existing project-wide policy for every Large* type, and matching it is what I meant by "feature parity with List".

At the same time the same buffer-size constraint also bounds plain List. The spec allows List up to 2^31 child elements, but JS's per-ArrayBuffer cap means a List<Float64> tops out around 2^29 elements - a quarter of the spec ceiling. So the property "JS implementation matches the spec's offset range" isn't currently true for List either, and bringing it up to that bar means the same chunked-children redesign, applied across List/LargeList/LargeUtf8/LargeBinary together. That's the design discussion I'd want to open as a separate issue rather than fold it in a wiring PR.

Unfortunately, designing and shepherding through a chunked-children rework - across all four types, across the visitor framework, the Builder layer, and the IPC reader/writer - is beyond what I can take on, neither as a standalone PR, and especially not as a part of this PR. I'm comfortable leaving this PR open and rebasing it once the community settles on a design for compliant >2^32-byte child storage that applies uniformly to the Large* family.

Otherwise, I maintain that the current PR stands on its own:

it's fully featured;
it introduces no new constraints or regressions and follows the existing implementation and API surface conventions;
it is very useful to the community to close the feature gap despite not being fully compliant on the maximum element count - today the library throws when it encounters a LargeList.

P.S. conforming producers should normalize the offsets in a batch, so in practice as long as the batch spans no more elements than can fit in the child buffer the current 2^53 arithmetic is unused headroom.

@trxcllnt

trxcllnt commented May 22, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

v8 supports ArrayBuffers larger than 4GiB:

# Allocate a 64GiB ArrayBuffer
$ { node -p 'var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024' & }; pid=$!; sleep 1; ps -p $pid -o vsz,rss,cmd; wait $pid
[1] 1431480
 VSZ RSS CMD
68109056 41444 node -p var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024
67108864
[1]+ Done node -p 'var a = new ArrayBuffer(2**36); setTimeout(() => {}, 2000); a.byteLength / 1024'

And it's not unusual for other language implementations to create RecordBatches larger than 4GiB (e.g. by using table.combine_chunks() from Python).

That said, it's unlikely anyone needs to address 8192 TiB of memory. I agree with your point that 53-bit ints are fine for indices, thanks for walking me through that 😅.

JS's per-ArrayBuffer cap means a List tops out around 2^29 elements - a quarter of the spec ceiling

Could you explain this more? List's offsets are Int32Array, which means its child should be able to support up to 2^31 individual elements. How did you arrive at 2^29 for the max child vector length?

edit:

I think I understand the confusion. You're saying if ArrayBuffer size was limited to 4GiB, then you could only represent 2^(32-4) 32-bit integers. Even if that were true, that doesn't mean the List child length is constrained, it would mean the number of lists that the ListVector represented would be constrained.

The child could have 2GiB of uint8_t values, and the ListVector offsets buffer could contain (2^28)+1) offsets, but the offset values themselves are still in the range 0..2^31. This is true in all other Arrow implementations as well.

@Karakatiza666

Karakatiza666 commented May 25, 2026

Copy link

Copy Markdown

Contributor Author

Fair point — my List<Float64> analogy conflated two different things, and your clarification helps. Compliance is about correctly interpreting offsets the wire format presents, not about the implementation realizing enough child storage to exercise the full offset range. Under that definition, List is conforming today (int32 offset values all fit in Number), and LargeList/LargeUtf8/LargeBinary aren't, because bigIntToNumber throws on offset values > 2^53.

One thing I wasn't sure about in your reply: when you said "that doesn't mean the List child length is constrained, it would mean the number of lists that the ListVector represented would be constrained" I couldn't tell whether you were pointing at a mechanism in the codebase that sidesteps the single-ArrayBuffer limit on child storage — I didn't find one, so I focused on the wire-format parsing compliance angle, which is what a new commit I pushed addresses.

The commit: VectorLoader now rebases offsets to 0 on load for LargeList, LargeUtf8, and LargeBinary. After rebasing, in-memory offsets are always bounded by the child buffer's element count (which fits in Number for anything the runtime can allocate), so downstream narrowing succeeds for any spec-conforming wire input — including sliced views with absolute, non-rebased offsets. Inputs whose offsets imply a child buffer larger than JS's ArrayBuffer cap now fail honestly at child-buffer allocation in readData, rather than later at offset narrowing — a cleaner failure mode that's a property of the JS runtime, not of the implementation.

Please let me know if this commit addresses your primary concern.

The remaining ceiling — JS's per-ArrayBuffer cap of ~2^32 bytes on a single contiguous child buffer — is allocation capacity, not interpretation, and it applies uniformly across List/LargeList/LargeUtf8/LargeBinary. Lifting it would mean a chunked-children redesign (child as Vector rather than single Data<U>), which is a substantial design change. As I previously expressed, I think that's a separate, more ambitious effort and deserves its own issue.

@Karakatiza666 Karakatiza666 requested a review from trxcllnt

May 25, 2026 20:57

trxcllnt

trxcllnt reviewed

May 26, 2026

View reviewed changes

src/visitor/vectorloader.ts Outdated

trxcllnt

trxcllnt reviewed

May 26, 2026

View reviewed changes

src/visitor/vectorloader.ts Outdated

Comment on lines +173 to +178

// Rebases int64 offsets to start at 0 so downstream `bigIntToNumber` narrowing

// always succeeds for spec-conforming wire input. Conforming producers may

// pre-shift offsets on sliced views, putting absolute values past 2^53; after

// rebasing, in-memory offsets are bounded by the child buffer's element count

// (which the JS runtime can allocate anyway). Genuinely too-large wire inputs

// fail honestly at child-buffer allocation rather than at offset narrowing.

@trxcllnt trxcllnt May 26, 2026

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment and implementation. If the producer already rebased the offsets buffer values relative to the slice offset, the first offset value should always be zero. If it isn't, it seems that's the real bug?

@trxcllnt

trxcllnt commented May 26, 2026

Copy link

Copy Markdown

Contributor

I couldn't tell whether you were pointing at a mechanism in the codebase that sidesteps the single-ArrayBuffer limit on child storage
...
The remaining ceiling — JS's per-ArrayBuffer cap of ~2^32 bytes on a single contiguous child buffer — is allocation capacity, not interpretation, and it applies uniformly across List/LargeList/LargeUtf8/LargeBinary. Lifting it would mean a chunked-children redesign (child as Vector rather than single Data), which is a substantial design change. As I previously expressed, I think that's a separate, more ambitious effort and deserves its own issue.

As demonstrated by the example I pasted above, there is no uniform 4GiB cap on ArrayBuffer size in JS. The only limit to List child size is that the value offsets type is Int32, thus the child can only contain 2^31 elements. This means the ListVector could represent a column of entirely single-element lists, but there could only be 2^31 of them due to the offset type being Int32, not any limit in ArrayBuffer allocation size.

I focused on the wire-format parsing compliance angle, which is what a new commit I pushed addresses

I don't think this commit is right. The reader should be zero-copy, and even if it was the correct thing to do, rebasing offsets when reading is not zero-copy.

@Karakatiza666 Karakatiza666 force-pushed the main branch from 11dc172 to 1e0c320 Compare

May 31, 2026 18:35

@Karakatiza666

Karakatiza666 commented May 31, 2026

Copy link

Copy Markdown

Contributor Author

You're right, I've pushed the updated commit. This version keeps the reader zero-copy for everything under 2^53 and only falls back to a copy for when the fallback rebase is necessary:

Final offset ≤ 2^53 (common): offsets returned as a zero-copy BigInt64Array view over the wire bytes — no allocation, no copy. Narrowing to number stays lazy (at element access) and is always lossless.
Final offset > 2^53, span ≤ 2^53 (slice with absolute offsets): rebased to 0 in bigint precision into a fresh buffer — the only path that copies — so the values stay narrowable.
Final offset > 2^53, span > 2^53: unreachable in practice — needs a child buffer > 2^53 elements present in the message, far past any engine's allocation limit.
bigIntToNumber 2^53 guard: after the rebase, never tripped by well-formed wire data; only reachable via hand-built in-memory Data (makeData, not rebased by design) — a loud failure for corrupt input.
Offset < 2^53 but past the child buffer: clamps silently in subarray/slice (pre-existing - built in methods behavior, unchanged).

@Karakatiza666 Karakatiza666 requested a review from trxcllnt

June 3, 2026 06:18

@Karakatiza666

Karakatiza666 commented Jun 3, 2026

Copy link

Copy Markdown

Contributor Author

@trxcllnt I'd appreciate it if you helped me carry this PR through review!

@kou

kou commented Jun 4, 2026

Copy link

Copy Markdown

Member

Could you check CI failures?

@Karakatiza666 Karakatiza666 force-pushed the main branch from 1e0c320 to 09fbbb1 Compare

June 4, 2026 05:10

@Karakatiza666

Karakatiza666 commented Jun 4, 2026

Copy link

Copy Markdown

Contributor Author

I rseolved the issue, I needed to avoid running the new test under UMD build. The test is still executed in other builds

@Karakatiza666


 fix: Conditionally rebase int64 offsets on load for Large* types

7b0727d

LargeList/LargeUtf8/LargeBinary offsets are read as a zero-copy
BigInt64Array view over the wire bytes and narrowed to a JS number
lazily, at element access. When the final offset exceeds
Number.MAX_SAFE_INTEGER — a slice serialized with absolute, non-rebased
offsets — readLargeOffsets rebases to 0 in bigint precision so the
values stay narrowable. This is the only path that copies; offsets
within the safe-integer range are passed through untouched.
The bigIntToNumber guard remains for offsets > 2^53 that survive to
access (only reachable via hand-built in-memory Data), failing loudly
on corrupt input rather than truncating.

@Karakatiza666 Karakatiza666 force-pushed the main branch from 09fbbb1 to 7b0727d Compare

June 4, 2026 09:29

trxcllnt

trxcllnt reviewed

Jun 4, 2026

View reviewed changes

src/visitor/vectorloader.ts Outdated

Comment on lines +173 to +188

// Large* types carry int64 offsets. Downstream code narrows each offset to a JS

// number when indexing buffers, which is lossless for any offset that indexes a

// buffer the runtime can actually allocate — so the common case is returned as a

// zero-copy view over the wire bytes, untouched. The exception is a sliced array

// serialized with absolute (non-rebased) offsets, whose values can exceed

// Number.MAX_SAFE_INTEGER even when the referenced span is small; only then do we

// rebase to 0 (the one case that requires a copy) so the offsets stay narrowable.

protected readLargeOffsets<T extends DataType>(type: T, buffer?: BufferRegion) {

const offsets = this.readOffsets(type, buffer);

const wide: BigInt64Array = toBigInt64Array(offsets);

if (wide.length === 0 || wide.at(-1)! <= BigInt(Number.MAX_SAFE_INTEGER)) {

return offsets;

}

const base = wide[0];

return wide.map((value) => value - base);

}

@trxcllnt trxcllnt Jun 4, 2026

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any language implementations that actually produce an IPC stream with, "a sliced array serialized with absolute (non-rebased) offsets, whose values can exceed Number.MAX_SAFE_INTEGER even when the referenced span is small?" IIUC rebasing valueOffsets to zero is part of IPC writers across the board.

@Karakatiza666 Karakatiza666 Jun 4, 2026

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude said the same thing; it sounded like it's more of a concern with manually, in-memory generated data. This fallback sounded to me like a low-cost defensive measure; LMK if you'd rather have it fail loudly to avoid adding the fallback logic.

@trxcllnt trxcllnt Jun 4, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format spec for variable-sized layouts states:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array. When serializing this layout, we recommend normalizing the offsets to start at 0.

While this sounds permissive, I don't know of any language implementations that don't strictly enforce this.

There are many possible ways to construct non-spec-conforming IPC streams, but it either isn't practical to attempt to catch them all, or we assume the user knows and has a very good reason for doing what they're doing.

For example, we also don't validate whether the offsets buffer for List, Utf8, or Binary vectors start with offset 0, because we assume the IPC stream is from a conforming implementation.

There are a few instances where I've taken advantage of ambiguities or UB in the spec to achieve specific results when working with certain technologies. For example, I've previously implemented a custom in-memory layout with large contiguous buffers on GPUs that included the IPC metadata for reading the contiguous buffers in chunks. I intentionally didn't rebase valueOffsets in that implementation, because the consumer mapped the valueOffsets to VBOs for GPU shaders. While this is technically UB in the spec, the CPU IPC streaming code shouldn't prohibit doing this.

I think we should remove this function and revert to calling readOffsets() like before.

@Karakatiza666 Karakatiza666 Jun 4, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I reverted to readOffsets() and removed readLargeOffsets. No widening of readOffsets was needed: it returns the raw offset bytes and makeData already picks toBigInt64Array for the Large* types, so they still get true int64 offsets.

@Karakatiza666


 Replace 'readLargeOffsets' with standard 'readOffsets'

99f4ea8

Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>

trxcllnt

trxcllnt approved these changes

Jun 5, 2026

View reviewed changes

@kou kou merged commit c7accb1 into apache:main

Jun 5, 2026

14 checks passed

@Karakatiza666

Karakatiza666 commented Jun 5, 2026 •

edited

Loading

Copy link

Copy Markdown

Contributor Author

Thank you for helping to land this PR!

@Karakatiza666 Karakatiza666 mentioned this pull request

Jun 5, 2026

feat: Add LargeList support for JavaScript bindings #299

Closed

7 tasks

Labels

None yet

6 participants

@Karakatiza666 @kou @GeorgeLeePatterson @trxcllnt @domoritz

Conversation

@Karakatiza666 Karakatiza666 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Implementation Details

Core Type System

Visitor Pattern Implementation

IPC Support

Latent Bug Fix

Builders

Testing

Public API

Test Plan

Checklist

Notes

Uh oh!

Karakatiza666 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kou commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

@Karakatiza666 Karakatiza666 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

@domoritz domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

@Karakatiza666 Karakatiza666 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

@Karakatiza666 Karakatiza666 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

@Karakatiza666 Karakatiza666 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GeorgeLeePatterson commented May 21, 2026

Uh oh!

GeorgeLeePatterson commented May 21, 2026

Uh oh!

Karakatiza666 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Karakatiza666 commented May 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

@trxcllnt trxcllnt May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

@Karakatiza666 Karakatiza666 May 22, 2026

Choose a reason for hiding this comment

Uh oh!

@trxcllnt trxcllnt May 22, 2026

Choose a reason for hiding this comment

Uh oh!

@trxcllnt trxcllnt left a comment

Choose a reason for hiding this comment

Uh oh!

Karakatiza666 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trxcllnt commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

@Karakatiza666 Karakatiza666 commented May 19, 2026 •

edited

Loading

Karakatiza666 commented May 19, 2026 •

edited

Loading

Karakatiza666 commented May 21, 2026 •

edited

Loading

Karakatiza666 commented May 21, 2026 •

edited

Loading

@trxcllnt trxcllnt May 22, 2026 •

edited

Loading

Karakatiza666 commented May 22, 2026 •

edited

Loading

trxcllnt commented May 22, 2026 •

edited

Loading

@trxcllnt trxcllnt Jun 4, 2026 •

edited

Loading

@Karakatiza666 Karakatiza666 Jun 4, 2026 •

edited

Loading

Karakatiza666 commented Jun 5, 2026 •

edited

Loading