I have a program which splits a url into host and path:
int main(int argc, char** argv){
std::vector<std::string> url = parseUrl(argv[1]);
memset(&hints, 0, sizeof hints);
hints.ai_family = AF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
int MAXDATASIZE = 1024;
std::string request = "GET " +url.at(1) + " HTTP/1.1\r\n\r\n";
const char* charRequest = request.c_str();
char buf[MAXDATASIZE];
if((status = getaddrinfo(url.at(0).c_str(), "80", &hints, &servinfo)) != 0){
std::cout << "getaddrinfo: " << gai_strerror(status) << "\n";
return 2;
} else {
std::cout << "Server struct created successfully\n";
}
if((sock = socket(servinfo->ai_family, servinfo->ai_socktype, servinfo->ai_protocol)) == -1){
std::cout << "Error creating socket\n";
return 2;
} else {
std::cout << "Successfully created socket\n";
}
if(connect(sock, servinfo->ai_addr, servinfo->ai_addrlen) == -1){
std::cout << "Error connecting to host " << url.at(0) << " at port " << "80" << "\n";
return 2;
} else {
std::cout << "Connected successfully to " << url.at(0) << "at port " << "80" << "\n";
}
if((send(sock, charRequest, strlen(charRequest), 0)) == -1){
std::cout << "Error communicating with website\n";
return 2;
} else {
std::cout << "Sent request successfully to " << url.at(0) << " at port " << "80" << "\n";
}
if(recv(sock, buf, MAXDATASIZE, 0) == -1){
std::cout << "Error recciving data from " << url.at(0) << " at port " << "80" << "\n";
} else {
std::cout << "Recieved data successfully from " << url.at(0) << " at port " << "80" << "\n";
}
std::ofstream file;
file.open(argv[2]);
file << buf;
file.close();
close(sock);
freeaddrinfo(servinfo);
}
// Returns vector with host as first argument and path as second
std::vector<std::string> parseUrl(std::string url){
std::vector<std::string> ret;
size_t dotPos = url.find('.');
size_t slashPos = url.find('/', dotPos);
std::string host = url.substr(0, slashPos);
std::string path = url.substr(slashPos, url.length());
ret.push_back(host);
ret.push_back(path);
return ret;
}
Does anyone have suggestions for improvement. Especially about the clunky vector return?
1 Answer 1
Rather than using a std::vector
and using the indices to specify the host and path, we can create a little struct and refer to the members by name:
struct URL
{
std::string host;
std::string path;
};
So we can refer to url.host
instead of url.at(0)
.
As pointed out in the comments, the parsing method could be more accurate and comprehensive. It's fine to only support a small subset of the url syntax, but we should at least check that we got what we expected in the string.
In this case, I think we should:
- Throw an error if we don't find a dot.
- Either throw an error if we don't find a slash, OR set the path part of the url to
"/"
to ensure that ourGET
request is valid.
int MAXDATASIZE = 1024;
...
char buf[MAXDATASIZE];
This doesn't compile for me. MAXDATASIZE
must be a constant.
We can remove the need for a separate constant by using std::array
:
std::array<char, 1024> buf;
// now we can use buf.size() :)
Even for something small like this, it's probably worth writing our own socket
class. This provides some cleanup safety (RAII), as well as making the code in main()
a lot simpler and easier to read by removing some of details of the network code. e.g. something like the following:
struct socket
{
public:
socket();
explicit socket(struct addrinfo const& address);
socket(socket const&) = delete;
socket& operator=(socket const&) = delete;
socket(socket&& other);
socket& operator=(socket&& other);
~socket();
std::size_t send(char const* data, std::size_t data_size);
std::size_t receive(char* buffer, std::size_t buffer_size);
bool is_open() const;
void close();
private:
SOCKET m_handle;
};
We could also move the address lookup out of main()
into a separate function, and keep the addrinfo
in a unique pointer with a custom deleter that calls freeaddrinfo()
. Something like:
std::unique_ptr<addrinfo, std::function<void(addrinfo*)>> lookup_address(std::string const& node, std::string const& service, int ai_family, int ai_socktype, int ai_protocol, int ai_flags = 0)
{
auto ai_hints = addrinfo();
// ... set up hints!
auto output = (addrinfo*) nullptr;
auto result = getaddrinfo(node.c_str(), service.c_str(), &ai_hints, &output);
// ... check errors!
return { output, [] (addrinfo* ai) { freeaddrinfo(ai); } };
}
So we might end up a main like:
int main(int argc, char** argv)
{
if (argc != 3)
throw std::runtime_error("Unexpected argument count!");
auto url = parse_url(argv[1]);
auto address = lookup_address(url.host, "80", AF_UNSPEC, SOCK_STREAM, 0, 0);
if (!address)
throw std::runtime_error("Failed address lookup!");
auto s = socket(*address);
if (!s.is_open())
throw std::runtime_error("Failed to connect!");
auto request = "GET " + url.path + " HTTP/1.1\r\n\r\n";
auto sent = s.send(request.c_str(), request.size() + 1);
if (sent != request.size() + 1)
throw std::runtime_error("Failed to send request!");
auto buffer = std::array<char, 1024>();
auto received = s.receive(buffer.data(), buffer.size());
if (received == 0)
throw std::runtime_error("No data received!");
std::ofstream(argv[2]).write(buffer.data(), received);
}
(code not compiled or tested).
moved
back avoiding any copies. \$\endgroup\$dot
in it. See: en.wikipedia.org/wiki/URL. This image I find particularly helpful: en.wikipedia.org/wiki/File:URI_syntax_diagram.svg but the full definition is here: tools.ietf.org/html/rfc3986 \$\endgroup\$schema
,user
,host
,port
,path
,query
, fragment` (removing the dividers). Not in the wild a lot of URL are mall formed most browser take this into account. The worst offenders is not having the?
before the query. \$\endgroup\$