I wrote a custom URL encoding function and I'd like to run it past a few other experienced C developers. I have tested it on a few strings, and it has worked on all of them.
This is to be run on iOS devices, so memory and processor use are potentially a small concern.
Do you see any potential problems - UB, wrong return, excessive memory or processor usage, difficulty reading?
Any suggestions for improvement?
NSString *urlEncode(NSString *orig)
{
static const BOOL safe[0x100] = {
['a'...'z'] = YES,
['A'...'Z'] = YES,
['0'...'9'] = YES,
['-'] = YES,
['_'] = YES,
['.'] = YES,
['~'] = YES
};
const char *digits = "0123456789ABCDEF";
const char *cstr = orig.UTF8String;
int clen = orig.length;
int nlen = 3*clen;
char newstr[nlen + 1];
const char *o = cstr + clen;
char *p = newstr + nlen;
*p = 0;
while(o > cstr) {
unsigned char c = *--o;
if(safe[c]) {
*--p = c;
} else {
*--p = digits[c&0xf];
*--p = digits[c>>4];
*--p = '%';
}
}
NSString *str = [NSString stringWithUTF8String:p];
return str;
}
Yes, newStr
could get long, but given that a URL> 2kB is unreliable, I think it's unlikely to be a problem in reality.
1 Answer 1
First of all, URL encoding is more nuanced than you might think. Be cautious when applying encoding.
As for your code, the only blatant bug I see is that it can segfault when encoding a non-ASCII string. The -length
method returns the number of Unicode characters in the string. Since you are working one char
at a time, you need the number of bytes instead.
NSUInteger clen = [orig lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
NSUInteger nlen = 3 * clen;
It would be better to change the method signature to:
NSString *urlEncode(const NSString *orig)
I think that using a for
loop would improve readability. Currently, you declare/initialize *o
, test it, and decrement it on three separate lines. A for
loop would provide an instantly recognizable structure to make it obvious how those three lines of code are related:
for (const char *o = cstr + clen - 1; o >= cstr; o--) {
unsigned char c = *o;
if (safe[c]) {
*--p = c;
} else {
*--p = digits[c & 0x0f];
*--p = digits[c >> 4];
*--p = '%';
}
}
The null char
would be better written as '0円'
.
Personally, I would choose to iterate forward rather than backward through the string, as it simplifies the loop by a few instructions, and is easier to understand. It also gives you the option of calling -maximumLengthOfBytesUsingEncoding
(which runs in O(1) time) instead of -lengthOfBytesUsingEncoding
.
unsigned char c;
char *p = newstr;
for (const char *o = cstr; (c = *o); o++) {
...
}
NSString
'sstringByAddingPercentEscapesUsingEncoding
? Or, if you need to expand on what is escaped (as you likely would to implement a true URL encoding), you should be able to useCFURLCreateStringByAddingPercentEscapes
(example of custom escaping: stackoverflow.com/a/8086845/567864) \$\endgroup\$stringByAddingPercentEscapesUsingEncoding
doesn't convert everything it should. Sure, I could useCFURLCreateStringByAddingPercentEscapes
(and have for other projects), but I thought I'd roll my own this time. \$\endgroup\$