This is designed for a high performance complex log analyzer. Very simple idea: read a file line-by-line as fast as possible.
I would appreciate any hints what should/could be improved in this code.
FastLineReader.h
/* Copyright (c) 2015 Simon Toth [email protected]
* Lincensed under the MIT license: http://opensource.org/licenses/MIT
*/
#ifndef FASTLINEREADER_H
#define FASTLINEREADER_H
// STD C++
#include <iosfwd>
/** Quick line-by-line parser of text files for POSIX/Linux
*
* This function provides a fast line parser with a callback model.
*
* @param filename file to be parsed
* @param callback function that will be called for each line
* @returns 0 on success, -1 if file could not be opened
**/
int fastLineParser(const char * const filename, void (*callback)(const char * const, const char * const));
#endif // FASTLINEREADER_H
FastLineReader.cpp
#include "FastLineReader.h"
// POSIX
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
// C++ STD
#include <iostream>
#include <cstring>
using namespace std;
int fastLineParser(const char * const filename, void (*callback)(const char * const, const char * const))
{
int fd = open(filename, O_RDONLY); // open file
if (fd == -1)
{
cerr << "Could not open \"" << filename << "\" for reading (" << strerror(errno) << ")." << endl;
return -1;
}
struct stat fs;
if (fstat(fd, &fs) == -1)
{
cerr << "Could not stat \"" << filename << "\" for reading (" << strerror(errno) << ")." << endl;
close(fd);
return -1;
}
posix_fadvise(fd,0,0,1); // announce the desire to sequentialy read this file
// silent error handling - weak error
char *buf = static_cast<char*>(mmap(0, static_cast<size_t>(fs.st_size), PROT_READ, MAP_SHARED, fd, 0));
if (buf == MAP_FAILED)
{
cerr << "Could not memory map file \"" << filename << "\" (" << strerror(errno) << ")." << endl;
close(fd);
return -1;
}
char *buff_end = buf + fs.st_size;
char *begin = buf, *end = NULL;
// search for newline in the remainder in the file
while ((end = static_cast<char*>(memchr(begin,'\n',static_cast<size_t>(buff_end-begin)))) != NULL)
{
callback(begin,end);
if (end != buff_end)
begin = end+1;
else
break;
}
// enable if you are working with malformed text files, proper text file needs to end with a newline
#ifdef MALFORMED_TEXFILE
callback(begin,buff_end);
#endif
munmap(buf, static_cast<size_t>(fs.st_size));
// silent error handling - weak error
close(fd);
return 0;
}
2 Answers 2
posix_fadvise
conveniently provides aPOSIX_FADV_SEQUENTIAL
macro. Use it instead of a magic1
and a comment.The client doesn't know in advance whether the file is malformed or not. Better detect a malformed text file in run time:
if (begin != buf_end)
A Bugs section of
posix_fadvise
man page says thatIn kernels before 2.6.6, if len was specified as 0, then this was interpreted literally as "zero bytes", rather than as meaning "all bytes through to the end of the file".
Since you already know the file size, better be safe and call it with
fs.st_size
instead of0
.Move
#include <iosfwd>
to thecpp
file. The client code doesn't need it.fastLineParser
can be used in C code; just declare it asextern "C"
I see no reason to use C++ here at all. However, if you do so, do not
use namespace std
.Finally, do you have any evidence that this is faster than
fgets
?
-
\$\begingroup\$ Can you explain what is the problem with
using namespace std
? \$\endgroup\$Šimon Tóth– Šimon Tóth2015年03月31日 20:56:54 +00:00Commented Mar 31, 2015 at 20:56 -
\$\begingroup\$ @Let_Me_Be The point of a namespace to isolate names defined by namespace from names defined in your program. By
using namespace std
you create a maintenance problem (collision in the best case, erroneous behaviour in the worst). See discussion here: stackoverflow.com/questions/1452721/… \$\endgroup\$vnp– vnp2015年03月31日 21:09:04 +00:00Commented Mar 31, 2015 at 21:09 -
1\$\begingroup\$ Wait what? How can using namespace create a collision? I'm not defining a new namespace, the function itself still exists inside the global namespace (which it, I admit, probably shouldn't) so that isn't affected and it definitely shouldn't affect the way standard library is linked. \$\endgroup\$Šimon Tóth– Šimon Tóth2015年03月31日 21:38:09 +00:00Commented Mar 31, 2015 at 21:38
-
\$\begingroup\$ Oh, OK i finally got what the link is trying to say. Theoretically there could be a name collision between the POSIX headers and STD C++ library headers I'm using. \$\endgroup\$Šimon Tóth– Šimon Tóth2015年03月31日 21:50:58 +00:00Commented Mar 31, 2015 at 21:50
I see repeated code for the error reporting. Since this is supposed to be a library (I don't see main()
anywhere), the caller may not want the errors to go to cerr
anyway. Why not pass in an error-handling callback function?
Also it looks like you're not completely const
-correct. The char *
pointers ( buff_end
, buf
, begin
, and end
) should all be const
right? And the static_cast<>
also?
reinterpret_cast
to cast fromvoid*
tochar*
.static_cast
is for things like integers and polymorphic types. \$\endgroup\$static_cast
when casting fromvoid*
static_cast
is just as valid (arguably more valid depending on how you read the standard), and it's not vulnerable to future surprises of changing a type somewhere and then attempting an insane cast. \$\endgroup\$