I wrote a function in C that reads an entire line from a stream into a dynamically allocated buffer. What would you suggest I should do to improve its performance? Also, are there any errors that could occur and are not handled?
This is the code:
#define ERRALLOC(ptr) \
if (ptr == NULL) { \
printf("Unable to allocate memory for %s", #ptr); \
exit(1); \
}
#define PACK_SIZE 512
size_t readLine(FILE *istream, char **line_buffer, size_t *bufsize) {
size_t line_size = 0;
int eol_flag = 0;
char temp_buffer[PACK_SIZE];
while (fgets(temp_buffer, PACK_SIZE, istream) != NULL) {
size_t temp_len = strlen(temp_buffer);
line_size += temp_len;
if (line_size+1 > *bufsize) {
*bufsize = line_size+1;
*line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));
ERRALLOC(*line_buffer)
}
if (temp_buffer[ temp_len - 1 ] == '\n') {
temp_buffer[ temp_len - 1 ] = '0円';
eol_flag = 1;
}
strcpy(*line_buffer + (line_size-temp_len), temp_buffer);
if (eol_flag) break;
}
return line_size;
}
3 Answers 3
Overall, it's a lot of effort (perhaps not in the right place) for one line. Textual data typically has lines that are very short, on the order of 80 bytes. If you're so deeply concerned about performance, you need to back out of this function and look at the I/O pattern of the calling program for more context. You'll very likely find that it's more efficient to have a buffer much larger than your current PACK_SIZE
, perhaps 4MB. Read the file into it one such chunk at a time with fread
, not fgets
; then iterate through to find and split on newlines. Roughly,
- Allocate a large buffer
fread
as much as possiblememchr
your way to a newline- Set that newline to
'0円'
- Consume the line
- Repeat 3-5 until you're at the end of the buffer
- If there is a partial line segment at the end, then
memcpy
the segment to the beginning of the buffer and plan for the nextfread
to use a buffer write position that starts after the end of the segment - Repeat until end of file
The principle to heed here is that I/O (particularly disk I/O) is expensive, and frequent calls to the I/O layer are expensive even when cached at the operating system level. Doing as much in RAM as possible is strongly preferred.
If you want a convenience interface, then consider accepting a FILE*
and a callback function, making the buffer locally rather than accepting it, and invoking the callback for each line when found.
-
1\$\begingroup\$ This approach is problematic for user I/O with
stdin
as function return is usually expected after each'\n'
. OP's code is usable with both stream and file lines. The expense of file I/O is less with SS drives than hard disks negating some of this approaches benefit. \$\endgroup\$chux– chux2021年01月20日 05:25:54 +00:00Commented Jan 20, 2021 at 5:25 -
\$\begingroup\$ Still, direct use of
fgets()
has its corner case drawbacks too - so good to see alternatives. \$\endgroup\$chux– chux2021年01月20日 05:29:43 +00:00Commented Jan 20, 2021 at 5:29
#define ERRALLOC(ptr) \ if (ptr == NULL) { \ printf("Unable to allocate memory for %s", #ptr); \ exit(1); \ }
That's a poor way to handle allocation failure, and makes your function pretty much unusable in any realistic program context. It's very rude to simply exit()
from within a library function, and it's unhelpful to write to stdout
, rather than stderr
that's designed for such messages. And why are you not writing a complete line, with \n
at the end?
*line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));
Don't do this; when allocation fails then the pointed-to memory is still allocated, but inaccessible. The standard pattern is
void *new_mem = realloc(buffer, newsize);
if (!new_mem) {
return buffer;
/* or free(buffer); return new_mem; */
}
buffer = new_mem;
Note also that sizeof (char)
is 1 by definition, so multiplying by it has no effect.
What does line_size
represent?
String length (like strlen()
), string size or what?
With input "abc\n"
, *line_buffer
is "abc"
and line_size
is 4.
With input "abc" End-of-file
, *line_buffer
is "abc"
and line_size
is 3.
Input error - non handling
When fgets()
returns NULL
due to an input error, consider also returning 0 regardless of the number of characters read.
Reading a null character
Code has undefined behavior (UB) when the first character read is a null character.
temp_buffer[ temp_len - 1 ]
becomes temp_buffer[SIZE_MAX]
When a null character is read anywhere, size_t temp_len = strlen(temp_buffer);
fails to correctly report the length of data read.
Minor simplification
strcpy(*line_buffer + (line_size-temp_len), temp_buffer);
could happen before if (temp_buffer[ ...
and simplify the break
condition.
Yet, I would think the line_size += temp_len;
should reduce by what happens in if (temp_buffer[ ...
.
I'd expect memcpy()
a tad faster here than strcpy()
, but fgets()
is the tall pole in the tent here.
Optimization
As long as *bufsize
is not tiny, I'd consider doing the first read directly into *line_buffer
. Since code is re-using that buffer, at some point, it is likely big enough for most following reads.
Sizing by reference rather than type
Consider the easier to review and maintain:
//*line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));
*line_buffer = realloc(*line_buffer, *bufsize * sizeof **line_buffer);
Missing example
Proper usage of readLine()
lacks a guiding example. Perhaps:
char *line_buffer = NULL;
size_t bufsize = 0;
size_ sz;
while ((sz = readLine(stdin, &line_buffer, &bufsize)) > 0) {
printf("Sz:%zu <%s>\n", sz, line_buffer);
}
free(line_buffer);
Naming consistency
3 styles in one line. Consider any 1 style.
readLine(FILE *istream, char **line_buffer, size_t *bufsize
readLine: camel case
line_buffer: `_` separation
bufsize: packed
Research restrict
This provide some self documentation and hints for compiler optimization.
///size_t readLine(FILE *istream, char **line_buffer, size_t *bufsize) {
size_t readLine(FILE * restrict istream, char ** restrict line_buffer,
size_t * restrict bufsize) {
Missing #include <>
s
At least stdio.h, string.h, strlib.h
Good uses of size_t
Good formating
Style
ERRALLOC(*line_buffer)
looks wrong without a trialing ;
.