I am building a web server from scratch and trying to make certain tasks faster by embedding C code into the mix for performance. Specifically I'm worried about how the std::string
class with the .find()
and other functions compare to straight pointer arithmetic.
#include <iostream>
#include <map>
#include <string>
std::map<std::string, std::string> http_request;
void parse_header( void * );
int main()
{
char * msg= "GET / HTTP/1.1\r\n"
"Host: 192.241.213.46:6880\r\n"
"Upgrade-Insecure-Requests: 1\r\n"
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
"Accept-Language: en-us\r\n"
"Accept-Encoding: gzip, deflate\r\n"
"Connection: keep-alive\r\n\r\n";
parse_header( msg );
}
void parse_header( void *msg )
{
char *head = (char *) msg;
char *mid;
char *tail = head;
if( sizeof( msg ) == 0 )
{
return;
}
// Find request type
while( *head++ != ' ');
http_request[ "Type" ] = std::string( ( char * ) msg ).substr( 0 , ( head - 1) - tail );
// Find path
tail = head;
while( *head++ != ' ');
http_request[ "Path" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );
// Find HTTP version
tail = head;
while( *head++ != '\r');
http_request[ "Version" ] = std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( head - 1) - tail );
// Map all headers from a key to a value
while( true )
{
tail = head + 1;
while( *head++ != '\r' );
mid = strstr( tail, ":" );
// Look for the failed strstr
if( tail > mid )
break;
http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );
}
// Determine if successful
std::cout << http_request[ "Host" ] << std::endl;
std::cout << http_request[ "Upgrade-Insecure-Requests" ] << std::endl;
std::cout << http_request[ "Accept" ] << std::endl;
std::cout << http_request[ "User-Agent" ] << std::endl;
std::cout << http_request[ "Accept-Language" ] << std::endl;
std::cout << http_request[ "Accept-Encoding" ] << std::endl;
std::cout << http_request[ "Connection" ] << std::endl;
}
Data comes in via the linux socket.send()
function, so it is a void *
type. I directly send that data to a parse_header
function to create a std::map
for easy access. Is it necessary to be this low-level, or can this speed be obtained by the STL?
Note: I killed OO design for a minimal example. The end-goal has an HTTP class.
Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics, I'd like it to be as conforming as possible while also being fast and lightweight. This seems to be the purest form of the cluge of C/C++.
2 Answers 2
Note++: I'd prefer to have this code review pertain to performance rather than c++ semantics
I get it, I really do (he says unconvincingly); but I also really think that you're going about this backwards. Step 1 is always to "make it work"; you don't get to "make it fast, make it cheap" until after you get it working. So you should momentarily forget about performance and just work on fixing the bugs and infelicities; and then if it turns out to be slow, you find the bottlenecks and improve them. (I'm not going as far as to say you need a profiler; you can find bottlenecks with eyeballs and intuition too. But there's no point in micro-polishing code until the bugs are out.)
So: sizeof(msg)==0
is never true. It looks like you're wanting to check the size of the message itself (like, number of bytes in the message); so, you need to add a function parameter to get the message size. (void *msg, size_t msg_size)
would do fine. In C++17 the new party line is probably to use a std::string_view
.
while( *head++ != ' ');
is a guaranteed segfault, as soon as I throw a malformed request at your server. Don't even give me the opportunity to segfault your code. Say what you mean:
while (head != msg_end && *head == ' ') ++head;
The repeated subexpression std::string( ( char * ) msg )
is a red flag: both because you're creating a string over and over (which does heap allocation, which is slow) and because you're using a C-style cast (type-casts are always red flags in both C and C++) and because you're casting away const
. /me checks the function signature again — No, you're not casting away const
. But you should be! A parsing function shouldn't be allowed to mutate the message buffer. Make that signature (const void *msg, size_t msg_size)
. Or even better: (const char *msg, const char *msg_end)
.
Iostreams are slow; don't use std::cout << ...
if you care about micro-optimizing performance. — But in fact you shouldn't be printing output from within your parsing function anyway! Let's just have the parsing function return the map of headers.
Also, your variable names seem confused. I'd try to make each extracted string run from head
to tail
, since that's how we usually talk about things in English. So:
void parse_header(const char *msg, const char *msg_end)
{
const char *head = msg;
const char *tail = msg;
// Find request type
while (tail != msg_end && *tail != ' ') ++tail;
http_request["Type"] = std::string(head, tail);
// Find path
while (tail != msg_end && *tail == ' ') ++tail;
head = tail;
while (tail != msg_end && *tail != ' ') ++tail;
http_request["Path"] = std::string(head, tail);
// Find HTTP version
while (tail != msg_end && *tail == ' ') ++tail;
head = tail;
while (tail != msg_end && *tail != '\r') ++tail;
http_request["Version"] = std::string(head, tail);
if (tail != msg_end) ++tail; // skip '\r'
// TODO: what about the trailing '\n'?
// Map all headers from a key to a value
head = tail;
while (head != msg_end && *head != '\r') {
while (tail != msg_end && *tail != '\r') ++tail;
const char *colon = memchr(head, tail, ':');
if (colon == NULL) {
// TODO: malformed headers, what should happen?
break;
}
const char *value = colon+1;
while (value != tail && *value == ' ') ++value;
http_request[ std::string(head, colon) ] = std::string(value, tail);
head = tail+1;
// TODO: what about the trailing '\n'?
}
return http_request;
}
Completely untested, mind you; but the most important things about the above code are that
- it uses the appropriate std::string
constructor instead of repeatedly allocating a gigantic string and then substr
'ing parts of it;
- it follows your original code (good idea!) in repeating the same basic idiom over and over, mechanically, so that if there ends up being a bug in the logic it's easy to see all the places to apply the fix.
Another thing to try if you have lots of headers, or if you're planning to do lots of lookups in the resulting map, would be to try using std::unordered_map
instead of std::map
. Less pointer-chasing often means faster code.
-
\$\begingroup\$ Thanks for the reminder about seg faults. I was assuming too much in my code. I also had no idea about that particular std::string constructor, so yes, that definitely simplified the code! I'm also guessing you'd rather message_end rather than message_length because you'd have to have more math to see if the pointer is in the right position, correct (i.e., tail - msg < msg_end - msg)? Finally, is &msg[ sizeof( msg ) - 1] the cheapest way to get the end of an array? \$\endgroup\$Shawnic Hedgehog– Shawnic Hedgehog2017年03月06日 21:10:17 +00:00Commented Mar 6, 2017 at 21:10
-
\$\begingroup\$ If you're interested, I also benchmarked the two which can be found here pastebin.com/bLBzapgi. This solution is approximately 3x faster than the original.. really goes to show how expensive the heap is. \$\endgroup\$Shawnic Hedgehog– Shawnic Hedgehog2017年03月06日 22:09:30 +00:00Commented Mar 6, 2017 at 22:09
-
\$\begingroup\$ @ShawnicHedgehog: I think you're still misunderstanding what
sizeof
actually does. It is not a synonym forstrlen
. \$\endgroup\$Quuxplusone– Quuxplusone2017年03月07日 00:38:04 +00:00Commented Mar 7, 2017 at 0:38 -
1\$\begingroup\$ I am aware that sizeof returns the amount of bytes. I'm misusing sizeof in the OP with a pointer. On my machine a character is 1 byte. If I wanted it to be perfect I would do sizeof( data ) / sizeof ( data[0] ) which would find the number of elements in an array. It works in this example because I'm specifying const char array[] instead of const char *array. I thought we are not able to update the OP's code, otherwise I would've fixed it. \$\endgroup\$Shawnic Hedgehog– Shawnic Hedgehog2017年03月07日 00:41:35 +00:00Commented Mar 7, 2017 at 0:41
-
1\$\begingroup\$ Didn't mean to sound torte, hard to set a tone over the internet. Thanks for the advice, strchr seems to be what I'm looking for. \$\endgroup\$Shawnic Hedgehog– Shawnic Hedgehog2017年03月07日 01:11:06 +00:00Commented Mar 7, 2017 at 1:11
First of all, The Http server send all the headers Line by Line. In your sample i see you created a dummy sample with the next buffer:
char * msg= "GET / HTTP/1.1\r\n"
"Host: 192.241.213.46:6880\r\n"
"Upgrade-Insecure-Requests: 1\r\n"
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8\r\n"
"Accept-Language: en-us\r\n"
"Accept-Encoding: gzip, deflate\r\n"
"Connection: keep-alive\r\n\r\n";
I suppose that you first recive all the header before analyzing it with your method void parse_header( void *msg )
... and that is the reason that the task to get each Key-Value is so hard because you are using parsers like this:
http_request[ std::string( ( char * ) msg ).substr( tail - ( char *)msg , ( mid ) - tail ) ] = std::string( ( char * ) msg ).substr( mid + 2 - ( char *) msg , ( head - 3 ) - mid );
What will I do to improve your code:
I will keep the
map<string,string>
to store all headers.Read the header one line by time and, before to continue with the other line, analyze it.
I don't see any issue with the code to parse the first but I will modify it to analyze is a single line and simplify the parsers.
If you read and analyze one line by line, the code to map all the keys with values could be like this:
void parseFirstLine(string line)
{
string key = "";
string value = "";
int position, lpost;
// Find request type
position = line.find(' ');
http_request[ "Type" ] = line.substr(0, position);
position++; //Skip character ' '
// Find path
position = line.find(' ', lpost);
http_request[ "Path" ] = line.substr(lpost, (position-lpost));
position++; //Skip character ' '
// Find HTTP version
http_request[ "Version" ] = line.substr(position);
}
void parseHeader(string line)
{
string key = "";
string value = "";
if(data.size() == 0) return;
int posFirst = line.find(":",0); //Look for separator ':'
key = line.substr(0, posFirst);
value = line.substr(posFirst + 1);
http_request[key] = value;
}
(*) This code could have some bugs because I'm writing this code without any developer environment to review this.
Is your decision where you get each line to analyze, at the moment you get it from socket reading or after finishing of read all the header.
-
\$\begingroup\$ I think you misunderstood.. I am the Http server; this is server-side code. Data is received using the socket recv function man7.org/linux/man-pages/man2/recv.2.html. This function requires a data buffer and a known data buffer length. To my knowledge it would be incredibly more difficult to read 1 line at a time since you have to assign a length, and you don't know the length because the first line is variable, and the specific headers required change if authorization is necessary. \$\endgroup\$Shawnic Hedgehog– Shawnic Hedgehog2017年03月06日 17:39:26 +00:00Commented Mar 6, 2017 at 17:39
-
\$\begingroup\$ Yes, @Shawnic-Hedgehog, i understood you and it is the same case for both sides. The server always recives a Http request which the header are segment by lines followed by the contest, the response has the same structure. I have some sample to read the Request line by line. I have a sample of how to read it, but use the lib curl: github.com/jorgemedra/HTTPSClient/blob/master/HTTPSClient/…. My post is in spanish. \$\endgroup\$Jorge Omar Medra– Jorge Omar Medra2017年03月06日 18:14:23 +00:00Commented Mar 6, 2017 at 18:14
sizeof(msg) == 0
, I don't think this is what you meant. \$\endgroup\$nginx
\$\endgroup\$