-
-
Notifications
You must be signed in to change notification settings - Fork 102
Fix UTF-8 character corruption at 8KB buffer boundaries in socket communication #1461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
+81
−25
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
yaa
commented
Jul 20, 2025
I also encountered this issue around the same time and had been working on a fix locally.
This fix appears to have a smaller impact compared to mine. What do you think?
diff --git a/src/plugin.js b/src/plugin.js index 71d2030..276551c 100644 --- a/src/plugin.js +++ b/src/plugin.js @@ -157,6 +157,7 @@ async function parse(parser, source, opts) { return new Promise((resolve, reject) => { const socket = new net.Socket(); + socket.setEncoding('utf-8'); let chunks = ""; socket.on("error", (error) => { @@ -164,7 +165,7 @@ async function parse(parser, source, opts) { }); socket.on("data", (data) => { - chunks += data.toString("utf-8"); + chunks += data; }); socket.on("end", () => {
Member
kddnewton
commented
Jul 30, 2025
@yaa I think that solution makes sense. According to https://nodejs.org/api/stream.html#readablesetencodingencoding that seems like it will fix the issue.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When formatting Ruby code containing multibyte UTF-8 characters (emojis, Japanese characters,
etc.), the plugin corrupts these characters if they happen to fall exactly at the 8192-byte (8KB)
boundary in the data stream between the Node.js plugin and Ruby server.
This issue likely originated from commit bd96faf (July 8, 2023) when the socket reading logic was
changed to fix JSON parsing for large data. The change may have inadvertently introduced a UTF-8
boundary issue where multibyte characters could be split across chunk boundaries.
Reproduction
The emoji gets corrupted because it starts at byte 8189 and is split across the 8KB boundary.
Solution
Implemented a length-prefixed protocol for socket communication:
Testing
Added comprehensive test coverage:
Impact