I have two functions, a collection of possible error codes, and a unit-testing framework.
The parsing of a character into its unary prefix and payload is handled by a few named functions and macros nlz
, nlo
, lsb
and msb
for counting leading zeros or ones and generating masks of ones on the right or left. The nlz
function could be re-written for speed or (maybe) for clarity; the current version is optimized for size to help abstract it away from the code that uses it.
I want to make these conversion functions really nice so I can use them as the basis of a language interpreter that's Unicode-capable.
It isn't factored into a library/header yet, but that's the obvious next step (and would complicate the presentation, I think).
minunit.h:
/* file: minunit.h
cf.http://www.jera.com/techinfo/jtns/jtn002.html */
#define mu_assert(message, test) do { if (!(test)) return message; } while (0)
#define mu_run_test(test) do { char *message = test(); tests_run++; \
if (message) return message; } while (0)
extern int tests_run;
io.c:
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include<stdlib.h>
#include <stdio.h>
//#include <sys/bitops.h> // ilog2
#include <math.h> // log2
/*
<-------- adapters ("apps-"hungarian naming)
utf8 utf8(ucs4...)
ucs4 ucs4(utf8...)
*/
enum errinfo {
no_error = 0,
invalid_encoding = 1,
invalid_extended_encoding = 2,
buffer_alloc_fail = 4,
bad_following_character = 8,
over_length_encoding = 16,
code_point_out_of_range = 32,
};
int_least32_t *ucs4(char *str, int n, enum errinfo *errinfo);
char *utf8(int_least32_t *ar, int n, int *an, enum errinfo *errinfo);
/* number of leading zeros of byte-sized value */
static int nlz(uint_least32_t x){ return 7 - (x? floor(log2(x)): -1); }
/* number of leading ones of byte-sized value */
#define nlo(x) nlz(0xFF^(x))
/* generate unsigned mask of x lsb ones */
#define lsb(x) ((1U<<(x))-1)
/* generate byte mask of x msb ones */
#define msb(x) (0xFF^lsb(8-(x)))
int_least32_t *ucs4(char *str, int n, enum errinfo *errinfo){
unsigned char *p=str;
int_least32_t *u,*buf;
uint_least32_t x;
int pre;
int i,j;
if (errinfo) *errinfo=0;
buf=u=malloc(n*sizeof*u);
if (!buf)
if (errinfo) *errinfo |= buffer_alloc_fail;
if (buf)
for (i=0; i<n && *p; i++){
switch(pre=nlo(x=*p++)){
case 0: break;
case 1: if (errinfo) *errinfo |= invalid_encoding;
x=0xFFFD;
break;
case 2:
case 3:
case 4: x&=lsb(8-pre);
for (j=1; j<pre; j++){
if (nlo(*p)!=1)
if (errinfo) *errinfo |= bad_following_character;
x=(x<<6) | (*p++&lsb(6));
}
break;
default: if (errinfo) *errinfo |= invalid_extended_encoding;
x=0xFFFD;
break;
}
if (x < ((int[]){0,0,0x80,0x800,0x10000})[pre])
if (errinfo) *errinfo |= over_length_encoding;
*u++=x;
}
return buf;
}
char *utf8(int_least32_t *ar, int n, int *an, enum errinfo *errinfo){
int i;
int_least32_t x;
char *p,*buf=p=malloc((n+1)*4);
if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;
if (buf) {
for (i=0; i<n; i++){
x=ar[i];
if (x <= lsb(7))
*p++=x;
else if (x <= lsb(11))
*p++=msb(2)| (x>>6),
*p++=msb(1)| (x & lsb(6));
else if (x <= lsb(16))
*p++=msb(3)| (x>>12),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else if (x <= 0x10FFFF)
*p++=msb(4)| (x>>18),
*p++=msb(1)| ((x>>12) & lsb(6)),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else
if (errinfo) *errinfo |= code_point_out_of_range;
}
*p++=0;
}
return buf;
}
#include "minunit.h"
int tests_run = 0;
#define test_case(c) if(c)return #c;
static char *test_nlz(){
//int i;for(i=0;i<256;i++)printf("%d <%x>, nlz %d, nlz~ %d\n",i,i,nlz(i),nlz(i^0xFF));
test_case(nlz(0)!=8)
test_case(nlz(1)!=7)
test_case(nlz(2)!=6)
test_case(nlz(4)!=5)
test_case(nlz(8)!=4)
test_case(nlz(16)!=3)
test_case(nlz(32)!=2)
test_case(nlz(64)!=1)
test_case(nlz(128)!=0)
//test_case(2!="baloney")
return 0;
}
static char *test_utf8(){
test_case(strcmp("abc",utf8((int[]){97,98,99},3,NULL,NULL)))
return 0;
}
static char *test_ucs4(){
test_case(memcmp((int[]){97,98,99},ucs4("abc",3,NULL),3*sizeof(int)))
return 0;
}
static char *test_transit(){
test_case(strcmp("abc",utf8(ucs4("abc",3,NULL),3,NULL,NULL)))
test_case(memcmp((int[]){97,98,99},ucs4(utf8((int[]){97,98,99},3,NULL,NULL),3,NULL),3*sizeof(int)))
return 0;
}
static char *all_tests(){
mu_run_test(test_nlz);
mu_run_test(test_utf8);
mu_run_test(test_ucs4);
mu_run_test(test_transit);
return 0;
}
int main() {
char *result=all_tests();
if (result != 0) {
printf("%s\n",result);
} else {
printf("ALL TESTS PASSED\n");
}
printf("Tests run: %d\n", tests_run);
return result != 0;
return 0;
}
Note this code has greatly benefited from a previous review in comp.lang.c which inspired a similar review around the same time.
-
\$\begingroup\$ stackoverflow.com/questions/38688417/… \$\endgroup\$Brent– Brent2016年08月05日 22:57:38 +00:00Commented Aug 5, 2016 at 22:57
1 Answer 1
Bug - encoding negative values
During encoding to UTF-8, your input array is defined as a signed 32 bit type. If you have negative input values (which should be illegal), your code will go into the x < lsb(7)
case instead of catching the illegal values.
Bug - read past end of buffer
In ucs4()
, your code could read past the end of the input buffer if the last encoding was incorrect. For example, if the last byte says that it is a 4 byte encoding, your code will read an additional 3 bytes, 2 of which will be past the null string terminator.
Minor bug - undefined shift
On a target platform with 16-bit ints, lsb(16)
is undefined because 1u << 16
is undefined.
int32_least_t -> int32_t
Why use int32_least_t
? Your code operates on UCS-4 which are 4 byte values. So you should use int32_t
to use exactly 4 byte codes instead of using "4 bytes or more".
Unused argument to utf8()
Your function utf8()
has an argument an
but it is never used. What is it for? Also, the input buffers to both your conversion functions should be marked const
.
nlz()
I know you said you prefer this to be short but using floating point to compute this pains me. If you are using gnu, you can use __builtin_clz
which is just as short. If you aren't using gnu, using your own handwritten function would end up being smaller than pulling in log2()
from a math library.
Also, this function should take a uint8_t
argument because it only works for one byte values.
Reduce indentation
Often, there are error cases in your code that you check for with an if
statement. Instead of nesting your code one level deeper per error case, you can often return from the error case instead to prevent unnecessary indentation. For example, in your code:
if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail; if (buf) { // Rest of function indented.
to this:
if (!buf) {
if (errinfo)
*errinfo |= buffer_alloc_fail;
return NULL;
}
// Rest of function not indented
Checking for the pointer repeatedly
In ucs4()
, you do this a lot:
if (errinfo) *errinfo |= bad_following_character;
Rather than check if the pointer is non-NULL every time, just use a local to hold the error and set the output value at the end:
{
enum errinfo error = 0;
// ...
if (nlo(*p)!=1)
error |= bad_following_character;
// ...
if (errinfo)
*errinfo = error;
return buf;
}
Code style
There's a lot that I don't like about your code style. First of all, I feel like you sacrifice clarity for brevity a lot. For example, here:
// Nested assignments? switch(pre=nlo(x=*p++)){
and here:
// Massively long and indecipherable line: test_case(memcmp((int[]){97,98,99},ucs4(utf8((int[]) {97,98,99},3,NULL,NULL),3,NULL),3*sizeof(int)))
Also, you use nonstandard style, like here with the comma operator:
// Comma operator instead of brace? else if (x <= lsb(11)) *p++=msb(2)| (x>>6), *p++=msb(1)| (x & lsb(6));
and here with the inline table:
// Inline table? if (x < ((int[]){0,0,0x80,0x800,0x10000})[pre])
Futhermore, you use short cryptic names that might mean the right thing to you, but to others might mean something else. For example, lsb(n)
to you means "a mask of n least significant bits". To me, it means "the least significant byte of n". Plus, it's strange that lsb
works with int sized values, but msb
only works with 8-bit values. On the other hand, you did comment these macros, so at least there was some explanation.
Summary
Other than the minor bugs pointed out above, the program seems to work fairly well. Mainly, the problems I have with the code are with its readability and clarity rather than its functionality. Your testing is minimal, and you should write tests that cover all the edge cases and error conditions.
-
\$\begingroup\$ I've been trying to think of better names than
lsb
andmsb
. What do you think oflo
andhi
, ormask
andhimask
? \$\endgroup\$luser droog– luser droog2015年08月03日 19:10:48 +00:00Commented Aug 3, 2015 at 19:10 -
\$\begingroup\$ @luserdroog I would prefer
lomask
andhimask
. \$\endgroup\$JS1– JS12015年08月03日 19:11:39 +00:00Commented Aug 3, 2015 at 19:11 -
\$\begingroup\$ Excellent ideas. Thank you. A new language is begun. Wrt brevity v. readability, the target audience for this project is J programmers, specifically J golfers, so that's my excuse. But I didn't want to apologize upfront, potentially skewing the results. Certain things like the inline table, I plan to keep the same. But I've adopted (or very carefully considered) each of your suggestions. \$\endgroup\$luser droog– luser droog2015年08月03日 22:56:33 +00:00Commented Aug 3, 2015 at 22:56