Parsing HTTP Headers in C++

Question 1

I am building a web server from scratch and trying to make certain tasks faster by embedding C code into the mix for performance. Specifically I'm worried about how the std::string class with the .find() and other functions compare to straight pointer arithmetic.

#include <iostream>
#include <map>
#include <string>
std::map<std::string, std::string> http_request;
void parse_header( void * );
int main()
{
 char * msg= "GET / HTTP/1.1\r\n"
 "Host: 192.241.213.46:6880\r\n"
 "Upgrade-Insecure-Requests: 1\r\n"
 "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
 "Accept-Language: en-us\r\n"
 "Accept-Encoding: gzip, deflate\r\n"
 "Connection: keep-alive\r\n\r\n";
 parse_header( msg );
}
void parse_header( void *msg )
{
 char *head = (char *) msg;
 char *mid;
 char *tail = head;
 if( sizeof( msg ) == 0 )
 {
 return;
 }
 // Find request type
 while( *head++ != ' ');
 http_request[ "Type" ] = std::string( ( char * ) msg ).substr( 0 , ( head - 1) - tail );
 // Find path
 tail = head;
 while( *head++ != ' ');
 http_request[ "Path" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );
 // Find HTTP version
 tail = head;
 while( *head++ != '\r');
 http_request[ "Version" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );
 // Map all headers from a key to a value
 while( true )
 {
 tail = head + 1;
 while( *head++ != '\r' );
 mid = strstr( tail, ":" ); 
 // Look for the failed strstr 
 if( tail > mid )
 break;
 http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );
 }
 // Determine if successful
 std::cout << http_request[ "Host" ] << std::endl; 
 std::cout << http_request[ "Upgrade-Insecure-Requests" ] << std::endl; 
 std::cout << http_request[ "Accept" ] << std::endl; 
 std::cout << http_request[ "User-Agent" ] << std::endl; 
 std::cout << http_request[ "Accept-Language" ] << std::endl; 
 std::cout << http_request[ "Accept-Encoding" ] << std::endl; 
 std::cout << http_request[ "Connection" ] << std::endl; 
}

Data comes in via the linux socket.send() function, so it is a void * type. I directly send that data to a parse_header function to create a std::map for easy access. Is it necessary to be this low-level, or can this speed be obtained by the STL?

Note: I killed OO design for a minimal example. The end-goal has an HTTP class.

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics, I'd like it to be as conforming as possible while also being fast and lightweight. This seems to be the purest form of the cluge of C/C++.

Question 2

There is something wrong: sizeof(msg) == 0, I don't think this is what you meant.

Question 3

You are right. sizeof the pointer was not what I was meaning! Thanks!

Question 4

This is a false worry. The string implementation is efficient.

Question 5

IF you are worried abut speed use nginx

Question 6

@Loki I use nginx for the webserver, this code is for handling API requests, sorry for the confusion. I benchmarked the code against Jorge Omar Medra's example and my code comes out roughly ~20us or so faster( vs ~50us avg), which is most likely due to less function calls. I guess the question now is: the c++ code provided is more maintainable because everyone hates pointers, but is the difference in speed worth it if I'm the only programmer? I was thinking if you were getting thousands of requests a second 30us is more beneficial than 50us.

Question 7

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics

I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 is always to "make it work"; you don't get to "make it fast, make it cheap" until after you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)

So: sizeof(msg)==0 is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter to get the message size. (void *msg, size_t msg_size) would do fine. In C++17 the new party line is probably to use a std::string_view.

while( *head++ != ' '); is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:

while (head != msg_end && *head == ' ') ++head;

The repeated subexpression std::string( ( char * ) msg ) is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++) and because you're casting away const. /me checks the function signature again — No, you're not casting away const. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature (const void *msg, size_t msg_size). Or even better: (const char *msg, const char *msg_end).

Iostreams are slow; don't use std::cout << ... if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing function return the map of headers.

Also, your variable names seem confused. I'd try to make each extracted string run from head to tail, since that's how we usually talk about things in English. So:

void parse_header(const char *msg, const char *msg_end)
{
 const char *head = msg;
 const char *tail = msg;
 // Find request type
 while (tail != msg_end && *tail != ' ') ++tail;
 http_request["Type"] = std::string(head, tail);
 // Find path
 while (tail != msg_end && *tail == ' ') ++tail;
 head = tail;
 while (tail != msg_end && *tail != ' ') ++tail;
 http_request["Path"] = std::string(head, tail);
 // Find HTTP version
 while (tail != msg_end && *tail == ' ') ++tail;
 head = tail;
 while (tail != msg_end && *tail != '\r') ++tail;
 http_request["Version"] = std::string(head, tail);
 if (tail != msg_end) ++tail; // skip '\r'
 // TODO: what about the trailing '\n'?
 // Map all headers from a key to a value
 head = tail;
 while (head != msg_end && *head != '\r') {
 while (tail != msg_end && *tail != '\r') ++tail;
 const char *colon = memchr(head, tail, ':');
 if (colon == NULL) {
 // TODO: malformed headers, what should happen?
 break;
 }
 const char *value = colon+1;
 while (value != tail && *value == ' ') ++value;
 http_request[ std::string(head, colon) ] = std::string(value, tail);
 head = tail+1;
 // TODO: what about the trailing '\n'?
 }
 return http_request;
}

Completely untested, mind you; but the most important things about the above code are that - it uses the appropriate std::string constructor instead of repeatedly allocating a gigantic string and then substr'ing parts of it; - it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.

Another thing to try if you have lots of headers, or if you're planning to do lots of lookups in the resulting map, would be to try using std::unordered_map instead of std::map. Less pointer-chasing often means faster code.

Question 8

Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array?

Question 9

If you're interested, I also benchmarked the two which can be found here pastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is.

Question 10

@ShawnicHedgehog: I think you're still misunderstanding what sizeof actually does. It is not a synonym for strlen.

Question 11

I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it.

Question 12

Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for.

Question 13

First of all, The Http server send all the headers Line by Line. In your sample i see you created a dummy sample with the next buffer:

char * msg= "GET / HTTP/1.1\r\n"
 "Host: 192.241.213.46:6880\r\n"
 "Upgrade-Insecure-Requests: 1\r\n"
 "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
 "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
 "Accept-Language: en-us\r\n"
 "Accept-Encoding: gzip, deflate\r\n"
 "Connection: keep-alive\r\n\r\n";

I suppose that you first recive all the header before analyzing it with your method void parse_header( void *msg )... and that is the reason that the task to get each Key-Value is so hard because you are using parsers like this:

http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );

What will I do to improve your code:

I will keep the map<string,string> to store all headers.
Read the header one line by time and, before to continue with the other line, analyze it.
I don't see any issue with the code to parse the first but I will modify it to analyze is a single line and simplify the parsers.

If you read and analyze one line by line, the code to map all the keys with values could be like this:

void parseFirstLine(string line)
{
 string key = "";
 string value = "";
 int position, lpost;
 // Find request type
 position = line.find(' ');
 http_request[ "Type" ] = line.substr(0, position);
 position++; //Skip character ' '
 // Find path
 position = line.find(' ', lpost);
 http_request[ "Path" ] = line.substr(lpost, (position-lpost));
 position++; //Skip character ' '
 // Find HTTP version
 http_request[ "Version" ] = line.substr(position);
}
void parseHeader(string line)
{
 string key = "";
 string value = "";
 if(data.size() == 0) return;
 int posFirst = line.find(":",0); //Look for separator ':'
 key = line.substr(0, posFirst);
 value = line.substr(posFirst + 1);
 http_request[key] = value;
}

(*) This code could have some bugs because I'm writing this code without any developer environment to review this.

Is your decision where you get each line to analyze, at the moment you get it from socket reading or after finishing of read all the header.

Question 14

I think you misunderstood.. I am the Http server; this is server-side code. Data is received using the socket recv function man7.org/linux/man-pages/man2/recv.2.html. This function requires a data buffer and a known data buffer length. To my knowledge it would be incredibly more difficult to read 1 line at a time since you have to assign a length, and you don't know the length because the first line is variable, and the specific headers required change if authorization is necessary.

Question 15

Yes, @Shawnic-Hedgehog, i understood you and it is the same case for both sides. The server always recives a Http request which the header are segment by lines followed by the contest, the response has the same structure. I have some sample to read the Request line by line. I have a sample of how to read it, but use the lib curl: github.com/jorgemedra/HTTPSClient/blob/master/HTTPSClient/…. My post is in spanish.

Quuxplusone Quuxplusone 19.6k2 gold badges44 silver badges91 bronze badges · Accepted Answer · 2017-03-06 20:03:48Z

Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics

I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 is always to "make it work"; you don't get to "make it fast, make it cheap" until after you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)

So: sizeof(msg)==0 is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter to get the message size. (void *msg, size_t msg_size) would do fine. In C++17 the new party line is probably to use a std::string_view.

while( *head++ != ' '); is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:

while (head != msg_end && *head == ' ') ++head;

The repeated subexpression std::string( ( char * ) msg ) is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++) and because you're casting away const. /me checks the function signature again — No, you're not casting away const. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature (const void *msg, size_t msg_size). Or even better: (const char *msg, const char *msg_end).

Iostreams are slow; don't use std::cout << ... if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing function return the map of headers.

Also, your variable names seem confused. I'd try to make each extracted string run from head to tail, since that's how we usually talk about things in English. So:

void parse_header(const char *msg, const char *msg_end)
{
 const char *head = msg;
 const char *tail = msg;
 // Find request type
 while (tail != msg_end && *tail != ' ') ++tail;
 http_request["Type"] = std::string(head, tail);
 // Find path
 while (tail != msg_end && *tail == ' ') ++tail;
 head = tail;
 while (tail != msg_end && *tail != ' ') ++tail;
 http_request["Path"] = std::string(head, tail);
 // Find HTTP version
 while (tail != msg_end && *tail == ' ') ++tail;
 head = tail;
 while (tail != msg_end && *tail != '\r') ++tail;
 http_request["Version"] = std::string(head, tail);
 if (tail != msg_end) ++tail; // skip '\r'
 // TODO: what about the trailing '\n'?
 // Map all headers from a key to a value
 head = tail;
 while (head != msg_end && *head != '\r') {
 while (tail != msg_end && *tail != '\r') ++tail;
 const char *colon = memchr(head, tail, ':');
 if (colon == NULL) {
 // TODO: malformed headers, what should happen?
 break;
 }
 const char *value = colon+1;
 while (value != tail && *value == ' ') ++value;
 http_request[ std::string(head, colon) ] = std::string(value, tail);
 head = tail+1;
 // TODO: what about the trailing '\n'?
 }
 return http_request;
}

Completely untested, mind you; but the most important things about the above code are that - it uses the appropriate std::string constructor instead of repeatedly allocating a gigantic string and then substr'ing parts of it; - it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.

Another thing to try if you have lots of headers, or if you're planning to do lots of lookups in the resulting map, would be to try using std::unordered_map instead of std::map. Less pointer-chasing often means faster code.

Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array?
If you're interested, I also benchmarked the two which can be found here pastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is.
@ShawnicHedgehog: I think you're still misunderstanding what sizeof actually does. It is not a synonym for strlen.
I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it.
Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for.

Stack Exchange Network

Parsing HTTP Headers in C++

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing HTTP Headers in C++

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions