-
Notifications
You must be signed in to change notification settings - Fork 892
fix: binary file check #377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@NicolasIRAGNE i've fix encoding issue on windows, we now use utf-16 le instead of utf-16.
I let you decide what we need to do from now.
I'll take a look whenever I have a bit of time, I'm not sure what this does but it seems better than initial check
But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars?
Also, #375 did this fix the problem as well?
But I am curious as to what the actual error is. Do we read too many characters? What happens if the file is just a huge numbers of ascii chars?
The current problem:
- we read a chunk of a file
- we wrongly determine binaries using a too much simple method
- this leads to include bytes within the final output context
- the tiktoken package, which uses rust, panic when unwrapping parsing result
Also, #375 did this fix the problem as well?
Yes. This is the main purpose of this with another one : The context should now ignore a lot more (if it's not all) binaries, so the context should be a lot more usable for LLMs on various repositories.
Uh oh!
There was an error while loading. Please reload this page.
Closes #375
Topic
tiktoken crash sometimes with binary files, after diging i've found that we did some check to ignore binary files but not strong enough.