1235 – std.string.tolower() fails on certain utf8 characters

D issues are now tracked on GitHub. This Bugzilla instance remains as a read-only archive.
Issue 1235 - std.string.tolower() fails on certain utf8 characters
Summary: std.string.tolower() fails on certain utf8 characters
Status: RESOLVED FIXED
Alias: None
Product: D
Classification: Unclassified
Component: phobos (show other issues)
Version: D2
Hardware: All All
: P2 minor
Assignee: Walter Bright
URL:
Keywords:
Depends on:
Blocks:
Reported: 2007年05月15日 19:08 UTC by Charles Gordon
Modified: 2015年06月09日 05:15 UTC (History)
0 users

See Also:


Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.
Description Charles Gordon 2007年05月15日 19:08:32 UTC
import std.string;
int main(char[][] args)
{
 printf("tolower(\"\\u0130e\") -> \"%.*s\"\n", tolower("\u0130e"));
 return 0;
}
produces incorrect output:
tolower("\u0130e") -> "i e"
Bug comes from erroneous code in phobos/std/string.d line 843:
 if (r.length != i + j)
 r = r[0 .. i + j];
Turkish dotted capital I (U+0130) is correctly converted to ASCII i (u+0069). But converted character does not use the same number of bytes as original character. The code above is therefore incorrect. As far as I understand the implementation, it could be removed completely.
A similar issue is present in toupper(), with the additional twist that conversion to uppercase should not be special cased for the ASCII subset in the Turkish Locale.
Additionally, non ASCII code is triggered by if (c >= 0x7F) where it should be if (c > 0x7F).
Comment 1 Walter Bright 2007年06月28日 22:57:41 UTC
I agree, with the exception that for UTF characters, there is no such thing as a locale. So the toupper("i") cannot be set to \u0130.
Comment 2 Walter Bright 2007年07月01日 14:03:43 UTC
Fixed DMD 1.018 and DMD 2.002


AltStyle によって変換されたページ (->オリジナル) /