I wrote this function last year to convert between the two encodings and just found it. It takes a text buffer and its size, then converts to UTF-8 if there's enough space.
What should be changed to improve quality?
int iso88951_to_utf8(unsigned char *content, size_t max_size)
{
unsigned char *copy;
size_t conversion_count; //number of chars to convert / bytes to add
copy = content;
conversion_count = 0;
//first run to see if there's enough space for the new bytes
while(*content)
{
if(*content >= 0x80)
{
++conversion_count;
}
++content;
}
if(content - copy + conversion_count >= max_size)
{
return ERROR;
}
while(content >= copy && conversion_count)
{
//repositioning current characters to make room for new bytes
if(*content < 0x80)
{
*(content + conversion_count) = *content;
}
else
{
*(content + conversion_count) = 0x80 | (*content & 0x3f); //last byte
*(content + --conversion_count) = 0xc0 | *content >> 6; //first byte
}
--content;
}
return SUCCESS;
}
4 Answers 4
The character set is named ISO-8859-1, not ISO-8895-1. Rename your function accordingly.
Change the return value to be more informative:
- Return 0 on success.
- If
max_size
is too small, return the minimum value ofmax_size
that would be sufficient to accommodate the output (including the trailing0円
).
I would also change the parameter to take a signed char *
to be a bit more natural.
I think that the implementation could look tidier if you dealt with pointers instead of offsets.
It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.
size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
char *src, *dst;
//first run to see if there's enough space for the new bytes
for (src = dst = content; *src; src++, dst++)
{
if (*src & 0x80)
{
// If the high bit is set in the ISO-8859-1 representation, then
// the UTF-8 representation requires two bytes (one more than usual).
++dst;
}
}
if (dst - content + 1 > max_size)
{
// Inform caller of the space required
return dst - content + 1;
}
*(dst + 1) = '0円';
while (dst > src)
{
if (*src & 0x80)
{
*dst-- = 0x80 | (*src & 0x3f); // trailing byte
*dst-- = 0xc0 | (*((unsigned char *)src--) >> 6); // leading byte
}
else
{
*dst-- = *src--;
}
}
return 0; // SUCCESS
}
How usable is the function? It relies on the content
string occupying a
buffer large enough to be extended. And if you take the suggestion from
@200_success that on error the function returns the minimum size necessary,
the user then has the added complexity of having to handle that error by
allocating a buffer and it must free the allocated buffer later - but it
must keep a note of whether the buffer was allocated.
Although I dislike dynamic allocation, I think this is a case where it makes sense always to allocate a new string in the function.
Here is a version that allocates space:
char* iso88959_to_utf8(const char *str)
{
char *utf8 = malloc(1 + (2 * strlen(str)));
if (utf8) {
char *c = utf8;
for (; *str; ++str) {
if (*str & 0x80) {
*c++ = *str;
} else {
*c++ = (char) (0xc0 | (unsigned) *str >> 6);
*c++ = (char) (0x80 | (*str & 0x3f));
}
}
*c++ = '0円';
}
return utf8;
}
You could add a realloc
call at the end to trim the excess space if you
thought it necessary (I'm not sure that it is, but it might depend upon the
application).
-
\$\begingroup\$ the cast on the second statement in the else block is unnecessary, isn't it? Also, I am not a fan of magic numbers without names and bit twiddling without comments. \$\endgroup\$Tim Seguine– Tim Seguine2014年02月04日 17:00:50 +00:00Commented Feb 4, 2014 at 17:00
-
2\$\begingroup\$ There is an
int
tochar
conversion that causes a warning fromclang
with-Wsign-conversion
. I just added the cast to keep that quiet :-) \$\endgroup\$William Morris– William Morris2014年02月04日 17:16:02 +00:00Commented Feb 4, 2014 at 17:16 -
2\$\begingroup\$ Swap
*c++ = *str;
with*c++ = (char) (0xc0 | (unsigned ... & 0x3f));
. \$\endgroup\$chux– chux2014年02月07日 23:10:58 +00:00Commented Feb 7, 2014 at 23:10
I'm not sure why you have content >= copy
in your second while loop. I would hope that while(conversion_count)
should be sufficient.
Your while
loops could be for
loops.
More comments would make it easier to read:
//first run to see how many extra bytes we'll need
//convert bytes from last to first to avoid altering not-yet-converted bytes
- I'd appreciate a link to whichever section of an ISO-8895-1 specification which states what bit-twiddling is needed (I can see what the code does in the final loop, but have not seen the specification of what it's supposed to do so haven't verified that).
Minor quibbles.
- I would prefer to see the variables assigned to when defined.
- Use a macro instead of the hard coded value of 0x80 or 0x3F. For someone who is not familiar with the ins and outs of UTF-8 or ISO-8895-1 naming them something like MASK_END or UPPER_VALUE makes for easier understanding.