URL percent encoding function

Question 1

I wrote a custom URL encoding function and I'd like to run it past a few other experienced C developers. I have tested it on a few strings, and it has worked on all of them.

This is to be run on iOS devices, so memory and processor use are potentially a small concern.

Do you see any potential problems - UB, wrong return, excessive memory or processor usage, difficulty reading?

Any suggestions for improvement?

NSString *urlEncode(NSString *orig)
{
 static const BOOL safe[0x100] = {
 ['a'...'z'] = YES,
 ['A'...'Z'] = YES,
 ['0'...'9'] = YES,
 ['-'] = YES,
 ['_'] = YES,
 ['.'] = YES,
 ['~'] = YES
 };
 const char *digits = "0123456789ABCDEF";
 const char *cstr = orig.UTF8String;
 int clen = orig.length;
 int nlen = 3*clen;
 char newstr[nlen + 1];
 const char *o = cstr + clen;
 char *p = newstr + nlen;
 *p = 0;
 while(o > cstr) {
 unsigned char c = *--o;
 if(safe[c]) {
 *--p = c;
 } else {
 *--p = digits[c&0xf];
 *--p = digits[c>>4];
 *--p = '%';
 }
 }
 NSString *str = [NSString stringWithUTF8String:p];
 return str;
}

Yes, newStr could get long, but given that a URL> 2kB is unreliable, I think it's unlikely to be a problem in reality.

Question 2

Seems like you should be able to just use NSString's stringByAddingPercentEscapesUsingEncoding? Or, if you need to expand on what is escaped (as you likely would to implement a true URL encoding), you should be able to use CFURLCreateStringByAddingPercentEscapes (example of custom escaping: stackoverflow.com/a/8086845/567864)

Question 3

stringByAddingPercentEscapesUsingEncoding doesn't convert everything it should. Sure, I could use CFURLCreateStringByAddingPercentEscapes (and have for other projects), but I thought I'd roll my own this time.

Question 4

First of all, URL encoding is more nuanced than you might think. Be cautious when applying encoding.

As for your code, the only blatant bug I see is that it can segfault when encoding a non-ASCII string. The -length method returns the number of Unicode characters in the string. Since you are working one char at a time, you need the number of bytes instead.

NSUInteger clen = [orig lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
NSUInteger nlen = 3 * clen;

It would be better to change the method signature to:

NSString *urlEncode(const NSString *orig)

I think that using a for loop would improve readability. Currently, you declare/initialize *o, test it, and decrement it on three separate lines. A for loop would provide an instantly recognizable structure to make it obvious how those three lines of code are related:

for (const char *o = cstr + clen - 1; o >= cstr; o--) {
 unsigned char c = *o;
 if (safe[c]) {
 *--p = c;
 } else {
 *--p = digits[c & 0x0f];
 *--p = digits[c >> 4];
 *--p = '%';
 }
}

The null char would be better written as '0円'.

Personally, I would choose to iterate forward rather than backward through the string, as it simplifies the loop by a few instructions, and is easier to understand. It also gives you the option of calling -maximumLengthOfBytesUsingEncoding (which runs in O(1) time) instead of -lengthOfBytesUsingEncoding.

unsigned char c;
char *p = newstr;
for (const char *o = cstr; (c = *o); o++) {
 ...
}

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 1 · 2013-08-18 06:55:35Z

First of all, URL encoding is more nuanced than you might think. Be cautious when applying encoding.

As for your code, the only blatant bug I see is that it can segfault when encoding a non-ASCII string. The -length method returns the number of Unicode characters in the string. Since you are working one char at a time, you need the number of bytes instead.

NSUInteger clen = [orig lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
NSUInteger nlen = 3 * clen;

It would be better to change the method signature to:

NSString *urlEncode(const NSString *orig)

I think that using a for loop would improve readability. Currently, you declare/initialize *o, test it, and decrement it on three separate lines. A for loop would provide an instantly recognizable structure to make it obvious how those three lines of code are related:

for (const char *o = cstr + clen - 1; o >= cstr; o--) {
 unsigned char c = *o;
 if (safe[c]) {
 *--p = c;
 } else {
 *--p = digits[c & 0x0f];
 *--p = digits[c >> 4];
 *--p = '%';
 }
}

The null char would be better written as '0円'.

Personally, I would choose to iterate forward rather than backward through the string, as it simplifies the loop by a few instructions, and is easier to understand. It also gives you the option of calling -maximumLengthOfBytesUsingEncoding (which runs in O(1) time) instead of -lengthOfBytesUsingEncoding.

unsigned char c;
char *p = newstr;
for (const char *o = cstr; (c = *o); o++) {
 ...
}

Stack Exchange Network

URL percent encoding function

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

URL percent encoding function

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions