-
-
Notifications
You must be signed in to change notification settings - Fork 301
Description
Description
If I run
cz changelog
and the commit messages contain Unicode characters like 🤦🏻♂️ (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb
) then I get the following traceback
Traceback (most recent call last):
File "/.../.venv/bin/cz", line 8, in <module>
sys.exit(main())
File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
args.func(conf, vars(args))()
File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
commits = git.get_commits(
File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
c = cmd.run(command)
File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>
The result of chardet.detect()
here
Line 26 in 2ff9f15
is:
{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}
An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes
fails. Using decode("utf-8")
works fine. It looks like issue chardet/chardet#148 is related to this.
I think the fix would be something like this to replace these lines of code:
stdout, stderr = process.communicate() return_code = process.returncode try: stdout_s = stdout.decode("utf-8") # Try this one first. except UnicodeDecodeError: result = chardet.detect(stdout) # Final result of the UniversalDetector’s prediction. # Consider checking confidence value of the result? stdout_s = stdout.decode(result["encoding"]) try: stderr_s = stderr.decode("utf-8") # Try this one first. except UnicodeDecodeError: result = chardet.detect(stderr) # Final result of the UniversalDetector’s prediction. # Consider checking confidence value of the result? stderr_s = stderr.decode(result["encoding"]) return Command(stdout_s, stderr_s, stdout, stderr, return_code)
Steps to reproduce
Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.
Current behavior
cz
throws an exception.
Desired behavior
cz
creates a changelog.
Screenshots
No response
Environment
> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin