Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

UnicodeDecodeError if commit messages contain Unicode characters #544

Closed
Labels
@jenstroeger

Description

Description

If I run

cz changelog

and the commit messages contain Unicode characters like 🤦🏻‍♂️ (which is an eight-byte utf-8 sequence: \xf0\x9f\xa4\xa6 \xf0\x9f\x8f\xbb) then I get the following traceback

Traceback (most recent call last):
 File "/.../.venv/bin/cz", line 8, in <module>
 sys.exit(main())
 File "/.../.venv/lib/python3.10/site-packages/commitizen/cli.py", line 389, in main
 args.func(conf, vars(args))()
 File "/.../.venv/lib/python3.10/site-packages/commitizen/commands/changelog.py", line 143, in __call__
 commits = git.get_commits(
 File "/.../.venv/lib/python3.10/site-packages/commitizen/git.py", line 98, in get_commits
 c = cmd.run(command)
 File "/.../.venv/lib/python3.10/site-packages/commitizen/cmd.py", line 32, in run
 stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),
 File "/opt/local/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/encodings/cp1254.py", line 15, in decode
 return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 1689: character maps to <undefined>

The result of chardet.detect() here

stdout.decode(chardet.detect(stdout)["encoding"] or "utf-8"),

is:

{'encoding': 'Windows-1254', 'confidence': 0.6864215607255395, 'language': 'Turkish'}

An interesting character encoding prediction with a low confidence, which in turn picks the incorrect codec and then decoding the bytes fails. Using decode("utf-8") works fine. It looks like issue chardet/chardet#148 is related to this.

I think the fix would be something like this to replace these lines of code:

stdout, stderr = process.communicate()
return_code = process.returncode
try:
 stdout_s = stdout.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
 result = chardet.detect(stdout) # Final result of the UniversalDetector’s prediction.
 # Consider checking confidence value of the result?
 stdout_s = stdout.decode(result["encoding"])
try:
 stderr_s = stderr.decode("utf-8") # Try this one first.
except UnicodeDecodeError:
 result = chardet.detect(stderr) # Final result of the UniversalDetector’s prediction.
 # Consider checking confidence value of the result?
 stderr_s = stderr.decode(result["encoding"])
return Command(stdout_s, stderr_s, stdout, stderr, return_code)

Steps to reproduce

Well I suppose you can add a few commits to a local branch an go crazy with much text and funky unicode characters (emojis with skin tones, flags, etc.), and then attempt to create a changelog.

Current behavior

cz throws an exception.

Desired behavior

cz creates a changelog.

Screenshots

No response

Environment

> cz version
2.29.3
> python --version
Python 3.10.5
> uname -a
Darwin pooh 18.7.0 Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64 x86_64 i386 Darwin

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /