Function to convert ISO-8859-1 to UTF-8

Question 1

I wrote this function last year to convert between the two encodings and just found it. It takes a text buffer and its size, then converts to UTF-8 if there's enough space.

What should be changed to improve quality?

int iso88951_to_utf8(unsigned char *content, size_t max_size)
{
 unsigned char *copy;
 size_t conversion_count; //number of chars to convert / bytes to add
 copy = content;
 conversion_count = 0;
 //first run to see if there's enough space for the new bytes
 while(*content)
 {
 if(*content >= 0x80)
 {
 ++conversion_count;
 }
 ++content;
 }
 if(content - copy + conversion_count >= max_size)
 {
 return ERROR;
 }
 while(content >= copy && conversion_count)
 {
 //repositioning current characters to make room for new bytes
 if(*content < 0x80)
 {
 *(content + conversion_count) = *content;
 }
 else
 {
 *(content + conversion_count) = 0x80 | (*content & 0x3f); //last byte
 *(content + --conversion_count) = 0xc0 | *content >> 6; //first byte
 }
 --content;
 }
 return SUCCESS;
}

Question 2

The character set is named ISO-8859-1, not ISO-8895-1. Rename your function accordingly.

Change the return value to be more informative:

Return 0 on success.
If max_size is too small, return the minimum value of max_size that would be sufficient to accommodate the output (including the trailing 0円).

I would also change the parameter to take a signed char * to be a bit more natural.

I think that the implementation could look tidier if you dealt with pointers instead of offsets.

It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.

size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
 char *src, *dst;
 //first run to see if there's enough space for the new bytes
 for (src = dst = content; *src; src++, dst++)
 {
 if (*src & 0x80)
 {
 // If the high bit is set in the ISO-8859-1 representation, then
 // the UTF-8 representation requires two bytes (one more than usual).
 ++dst;
 }
 }
 if (dst - content + 1 > max_size)
 {
 // Inform caller of the space required
 return dst - content + 1;
 }
 *(dst + 1) = '0円';
 while (dst > src)
 {
 if (*src & 0x80)
 {
 *dst-- = 0x80 | (*src & 0x3f); // trailing byte
 *dst-- = 0xc0 | (*((unsigned char *)src--) >> 6); // leading byte
 }
 else
 {
 *dst-- = *src--;
 }
 }
 return 0; // SUCCESS
}

Question 3

How usable is the function? It relies on the content string occupying a buffer large enough to be extended. And if you take the suggestion from @200_success that on error the function returns the minimum size necessary, the user then has the added complexity of having to handle that error by allocating a buffer and it must free the allocated buffer later - but it must keep a note of whether the buffer was allocated.

Although I dislike dynamic allocation, I think this is a case where it makes sense always to allocate a new string in the function.

Here is a version that allocates space:

char* iso88959_to_utf8(const char *str)
{
 char *utf8 = malloc(1 + (2 * strlen(str)));
 if (utf8) {
 char *c = utf8;
 for (; *str; ++str) {
 if (*str & 0x80) {
 *c++ = *str;
 } else {
 *c++ = (char) (0xc0 | (unsigned) *str >> 6);
 *c++ = (char) (0x80 | (*str & 0x3f));
 }
 }
 *c++ = '0円';
 }
 return utf8;
}

You could add a realloc call at the end to trim the excess space if you thought it necessary (I'm not sure that it is, but it might depend upon the application).

Question 4

the cast on the second statement in the else block is unnecessary, isn't it? Also, I am not a fan of magic numbers without names and bit twiddling without comments.

Question 5

There is an int to char conversion that causes a warning from clang with -Wsign-conversion. I just added the cast to keep that quiet :-)

Question 6

Swap *c++ = *str; with *c++ = (char) (0xc0 | (unsigned ... & 0x3f));.

Question 7

I'm not sure why you have content >= copy in your second while loop. I would hope that while(conversion_count) should be sufficient.

Your while loops could be for loops.

More comments would make it easier to read:

//first run to see how many extra bytes we'll need
//convert bytes from last to first to avoid altering not-yet-converted bytes
I'd appreciate a link to whichever section of an ISO-8895-1 specification which states what bit-twiddling is needed (I can see what the code does in the final loop, but have not seen the specification of what it's supposed to do so haven't verified that).

Question 8

Minor quibbles.

I would prefer to see the variables assigned to when defined.
Use a macro instead of the hard coded value of 0x80 or 0x3F. For someone who is not familiar with the ins and outs of UTF-8 or ISO-8895-1 naming them something like MASK_END or UPPER_VALUE makes for easier understanding.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Accepted Answer · 2014-02-03 22:37:10Z

The character set is named ISO-8859-1, not ISO-8895-1. Rename your function accordingly.

Change the return value to be more informative:

Return 0 on success.
If max_size is too small, return the minimum value of max_size that would be sufficient to accommodate the output (including the trailing 0円).

I would also change the parameter to take a signed char * to be a bit more natural.

I think that the implementation could look tidier if you dealt with pointers instead of offsets.

It would be nice if you NUL-terminated the result, so that the caller does not have to zero out the entire buffer before calling this function.

size_t iso8859_1_to_utf8(char *content, size_t max_size)
{
 char *src, *dst;
 //first run to see if there's enough space for the new bytes
 for (src = dst = content; *src; src++, dst++)
 {
 if (*src & 0x80)
 {
 // If the high bit is set in the ISO-8859-1 representation, then
 // the UTF-8 representation requires two bytes (one more than usual).
 ++dst;
 }
 }
 if (dst - content + 1 > max_size)
 {
 // Inform caller of the space required
 return dst - content + 1;
 }
 *(dst + 1) = '0円';
 while (dst > src)
 {
 if (*src & 0x80)
 {
 *dst-- = 0x80 | (*src & 0x3f); // trailing byte
 *dst-- = 0xc0 | (*((unsigned char *)src--) >> 6); // leading byte
 }
 else
 {
 *dst-- = *src--;
 }
 }
 return 0; // SUCCESS
}

Stack Exchange Network

Function to convert ISO-8859-1 to UTF-8

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Function to convert ISO-8859-1 to UTF-8

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions