-
Notifications
You must be signed in to change notification settings - Fork 23
fix: throw on schema mismatch in RecordBatchStreamWriter.write()#389
fix: throw on schema mismatch in RecordBatchStreamWriter.write() #389rustyconover wants to merge 3 commits into
Conversation
rustyconover
commented
Mar 4, 2026
@raulcd could you please trigger the CI on this too? Thanks.
raulcd
commented
Mar 4, 2026
sure thing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a data-loss bug in RecordBatchStreamWriter.write() where schema mismatches with autoDestroy: true (the default) previously caused silent data loss — the writer would close and discard the batch without any error. The fix throws a descriptive Error instead, and also adds a JSDoc comment for the autoDestroy option.
Changes:
- Modified
write()inRecordBatchWriterto throw a descriptiveErroron schema mismatch whenautoDestroy: true, instead of silently dropping the batch - Added a JSDoc description for the
autoDestroyoption inRecordBatchStreamWriterOptions - Added a new test file with 4 tests covering the schema mismatch behaviors
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/ipc/writer.ts |
Core fix: throws Error on schema mismatch when autoDestroy=true; added autoDestroy JSDoc |
test/unit/ipc/writer/schema-mismatch-tests.ts |
New test file covering schema mismatch throw, multi-stream output, closed writer behavior, and error message content |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right solution. The call to this.close() closes the underlying sink. Without it, the underlying sink is left open, and data loss may occur for the already-written batches.
This seems like something that needs to be handled by the user checking whether the writer is closed before writing the next batch:
for (const batch in batches) { writer.write(batch); if writer.closed { break; } }
Generally users shouldn't be in a scenario where they're writing batches with heterogeneous schemas without knowing it, but if they are, they should validate that before writing (like above).
Previously, writing a RecordBatch with a mismatched schema to a writer with autoDestroy=true (the default) would silently close the writer and drop the batch, causing data loss with no error or warning. Now throws an Error with both the expected and received schemas, making the mismatch immediately visible to the caller. Closes apache#388 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
8c1462d to
9f1b73e
Compare
rustyconover
commented
Mar 5, 2026
I think my original scope was too big so I'm going to scope this down to just the core issue that took a few hours of debugging.
So based on your feedback I:
- Removed redundant tests — dropped the
autoDestroy=falsemulti-stream test (already covered bystream-writer-tests.ts,streams-dom-tests.ts, andstreams-node-tests.ts) and the closed-writer test (pre-existing behavior, not introduced by this PR) - Added JSDoc for the new throwing behavior on
autoDestroy - Switched to static imports for consistency with other test files
The PR is now scoped to just the core fix: when autoDestroy=true (the default), writing a batch with a mismatched schema throws an error instead of silently closing the writer and dropping the batch.
Why throw instead of silently closing: The previous behavior called return this.close() when it detected a schema mismatch with autoDestroy=true. From the caller's perspective, writer.write(batch) returned successfully — there was no indication that the batch was dropped and the writer was closed. This made schema mismatches extremely difficult to debug, since data silently disappeared with no error, no warning, and no way to distinguish a successful write from a dropped one. Throwing makes the problem immediately visible at the point where it occurs.
It may be that the user should check if they are trying to write mismatched schemas, but if you use PyArrow or the other Arrow implementations they raise an exception when you try to write a record batch that doesn't match the schema. I'm trying to promote consistent behavior.
rustyconover
commented
Mar 5, 2026
@raulcd can you trigger CI on this one again too? Thank you!
Co-authored-by: Paul Taylor <178183+trxcllnt@users.noreply.github.com>
Co-authored-by: Paul Taylor <178183+trxcllnt@users.noreply.github.com>
trxcllnt
commented
Mar 5, 2026
This PR should also update the node Transform stream's _write() implementation to try/catch the error and pass it to the callback:
_write(x: any, _: string, cb: CB) { const writer = this._writer; let err = []; try { writer?.write(x); } catch (e) { err.push(e); } cb && cb.apply(undefined, err); return true; }
I can't recall at the moment what DOM streams do if write() throws, but it's also worth considering the impact of this change to this line.
Trying to think through other implications of throwing here.
Summary
RecordBatchStreamWriter.write()previously silently closed the writer and dropped the batch when a schema mismatch was detected withautoDestroy: true(the default), causing data loss with no errorErrorwith both expected and received schemas, making mismatches immediately visibleautoDestroyoption explaining its behaviorTest plan
schema-mismatch-tests.tswith 4 tests covering:autoDestroy=true: throws on schema mismatchautoDestroy=false: resets schema and writes multi-stream output (existing behavior preserved)npx jest --testPathPattern='schema-mismatch')Closes #388
🤖 Generated with Claude Code