I have the following mockup that I need some feedback on how I might improve the url_decode
function to make it faster and use less memory. There are a couple of requirements. I have to use the StringArg
and StringReturn
structs. I also need to completely decode the string. I am aware of double encoded attacks, but this is for data mining and I need to get the string fully decoded, hence the while loop.
Please have a look at function StringReturn* url_decode(char* line)
and suggest anything that I can do to improve the speed or reduce the memory used.
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <ctype.h>
struct StringArg
{
int length;
char* data;
StringArg(const char* inData)
{
length = strlen(inData);
data = new char[length+1];
strcpy(data, inData);
}
};
struct StringReturn
{
int size;
char* data;
StringReturn(int inSize)
{
size = inSize;
data = new char[size+1];
}
};
StringArg* stringArg(char* line)
{
return new StringArg(line);
}
StringReturn* stringReturnInfo(int size)
{
return new StringReturn(size);
}
StringReturn* url_decode(char* line)
{
StringArg *str;
str = stringArg(line);
int lengths = str->length;
char* urlData = new char[lengths+1];
char* fmt = new char[4];
memcpy(urlData, str->data, str->length);
urlData[lengths] = 0;
bool bFoundChar = false;
int j = 0;
do
{
j = 0;
bFoundChar = false;
for (int i = 0; i < lengths; i++)
{
char ch = urlData[i];
if (ch == '+')
{
urlData[j++] = ' ';
}
else if (ch == '%' && i+2 < lengths && isxdigit(urlData[i+1]) && isxdigit(urlData[i+2]))
{
bFoundChar = true;
fmt = new char[4];
fmt[0] = urlData[++i];
fmt[1] = urlData[++i];
fmt[2] = 0;
urlData[j++] = (char)strtol(fmt, NULL, 16);
}
else {
urlData[j++] = ch;
}
}
lengths = j;
}while (bFoundChar);
StringReturn *ret = stringReturnInfo(j);
ret->size = j;
memcpy(ret->data, urlData, j);
delete urlData;
delete fmt;
return ret;
}
void test_decode(char* line)
{
StringReturn *str = url_decode(line);
printf(str->data);
}
int main(int argc, const char * argv[])
{
test_decode("bob.com?url=http%3A%2F%2Fsteve.com%3Furl%3Dhttp%253A%252F%252Fjohn.com");
return 0;
}
1 Answer 1
This is doomed to failure:
struct StringArg
{
int length;
char* data;
StringArg(const char* inData)
{
length = strlen(inData);
data = new char[length+1];
strcpy(data, inData);
}
};
This can only lead to disaster. It is unlikely it is more efficient than std::string and is much more likely to lead to memory leaks.
Use std::string it will work is correct and has been optimized very heavily you can not beat it using your own hand written one. This is a premature optimization.
Same Again:
struct StringReturn
{
int size;
char* data;
StringReturn(int inSize)
{
size = inSize;
data = new char[size+1];
}
};
Don't pass pointers around:
StringArg* stringArg(char* line)
{
return new StringArg(line);
}
Use std::unique_ptr to encapsulate the pointer object so that it does correct memory management. If you are doing things correctly the resulting code optimizes to a no-op and if you don't do things correctly it saves your ass. In fact I would just throw away your string classes and use std::string. The copy construction from a return is nearly always elided by NRVO and thus is much more efficient than you seem to think.
Same Again
StringReturn* stringReturnInfo(int size)
{
return new StringReturn(size);
}
Manually copying a C-string is very painful. Use std::string
int lengths = str->length;
char* urlData = new char[lengths+1];
char* fmt = new char[4];
memcpy(urlData, str->data, str->length);
urlData[lengths] = 0;
// Or you can do:
std::string urlData = line;
Also your code is not exception safe.
Please use RIAA to make sure that there are no leaks (or use std::string that does this already).
The definition of a URL can be found here: https://www.rfc-editor.org/rfc/rfc3986
Pretending the '+' is a space is only valid in the path section of a URI and not valid in the query or fragment parts of the URI thus scanning the whole string for '+' is not correct.
I also need to completely decode the string. I am aware of double encoded attacks
The specification does not allow for double (or nested) encoding. If it decodes to another '%' then that is the value. You don't try and re-interpret that value.