3
\$\begingroup\$

I wrote a function in C that reads an entire line from a stream into a dynamically allocated buffer. What would you suggest I should do to improve its performance? Also, are there any errors that could occur and are not handled?

This is the code:

#define ERRALLOC(ptr) \
if (ptr == NULL) { \
 printf("Unable to allocate memory for %s", #ptr); \
 exit(1); \
}
#define PACK_SIZE 512
size_t readLine(FILE *istream, char **line_buffer, size_t *bufsize) {
 size_t line_size = 0;
 int eol_flag = 0;
 char temp_buffer[PACK_SIZE];
 while (fgets(temp_buffer, PACK_SIZE, istream) != NULL) {
 size_t temp_len = strlen(temp_buffer);
 line_size += temp_len;
 if (line_size+1 > *bufsize) {
 *bufsize = line_size+1;
 *line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));
 ERRALLOC(*line_buffer)
 }
 if (temp_buffer[ temp_len - 1 ] == '\n') {
 temp_buffer[ temp_len - 1 ] = '0円';
 eol_flag = 1;
 }
 strcpy(*line_buffer + (line_size-temp_len), temp_buffer);
 if (eol_flag) break;
 }
 return line_size;
}
asked Jan 18, 2021 at 17:31
\$\endgroup\$

3 Answers 3

2
\$\begingroup\$

Overall, it's a lot of effort (perhaps not in the right place) for one line. Textual data typically has lines that are very short, on the order of 80 bytes. If you're so deeply concerned about performance, you need to back out of this function and look at the I/O pattern of the calling program for more context. You'll very likely find that it's more efficient to have a buffer much larger than your current PACK_SIZE, perhaps 4MB. Read the file into it one such chunk at a time with fread, not fgets; then iterate through to find and split on newlines. Roughly,

  1. Allocate a large buffer
  2. fread as much as possible
  3. memchr your way to a newline
  4. Set that newline to '0円'
  5. Consume the line
  6. Repeat 3-5 until you're at the end of the buffer
  7. If there is a partial line segment at the end, then memcpy the segment to the beginning of the buffer and plan for the next fread to use a buffer write position that starts after the end of the segment
  8. Repeat until end of file

The principle to heed here is that I/O (particularly disk I/O) is expensive, and frequent calls to the I/O layer are expensive even when cached at the operating system level. Doing as much in RAM as possible is strongly preferred.

If you want a convenience interface, then consider accepting a FILE* and a callback function, making the buffer locally rather than accepting it, and invoking the callback for each line when found.

answered Jan 18, 2021 at 19:41
\$\endgroup\$
2
  • 1
    \$\begingroup\$ This approach is problematic for user I/O with stdin as function return is usually expected after each '\n'. OP's code is usable with both stream and file lines. The expense of file I/O is less with SS drives than hard disks negating some of this approaches benefit. \$\endgroup\$ Commented Jan 20, 2021 at 5:25
  • \$\begingroup\$ Still, direct use of fgets() has its corner case drawbacks too - so good to see alternatives. \$\endgroup\$ Commented Jan 20, 2021 at 5:29
2
\$\begingroup\$
#define ERRALLOC(ptr) \
if (ptr == NULL) { \
 printf("Unable to allocate memory for %s", #ptr); \
 exit(1); \
}

That's a poor way to handle allocation failure, and makes your function pretty much unusable in any realistic program context. It's very rude to simply exit() from within a library function, and it's unhelpful to write to stdout, rather than stderr that's designed for such messages. And why are you not writing a complete line, with \n at the end?

 *line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));

Don't do this; when allocation fails then the pointed-to memory is still allocated, but inaccessible. The standard pattern is

void *new_mem = realloc(buffer, newsize);
if (!new_mem) {
 return buffer;
 /* or free(buffer); return new_mem; */
}
buffer = new_mem;

Note also that sizeof (char) is 1 by definition, so multiplying by it has no effect.

answered Jan 19, 2021 at 13:47
\$\endgroup\$
1
\$\begingroup\$

What does line_size represent?

String length (like strlen()), string size or what?

With input "abc\n", *line_buffer is "abc" and line_size is 4.

With input "abc" End-of-file, *line_buffer is "abc" and line_size is 3.

Input error - non handling

When fgets() returns NULL due to an input error, consider also returning 0 regardless of the number of characters read.

Reading a null character

Code has undefined behavior (UB) when the first character read is a null character.

temp_buffer[ temp_len - 1 ] becomes temp_buffer[SIZE_MAX]

When a null character is read anywhere, size_t temp_len = strlen(temp_buffer); fails to correctly report the length of data read.

Minor simplification

strcpy(*line_buffer + (line_size-temp_len), temp_buffer); could happen before if (temp_buffer[ ... and simplify the break condition.

Yet, I would think the line_size += temp_len; should reduce by what happens in if (temp_buffer[ ....

I'd expect memcpy() a tad faster here than strcpy(), but fgets() is the tall pole in the tent here.

Optimization

As long as *bufsize is not tiny, I'd consider doing the first read directly into *line_buffer. Since code is re-using that buffer, at some point, it is likely big enough for most following reads.

Sizing by reference rather than type

Consider the easier to review and maintain:

//*line_buffer = realloc(*line_buffer, *bufsize * sizeof(char));
*line_buffer = realloc(*line_buffer, *bufsize * sizeof **line_buffer);

Missing example

Proper usage of readLine() lacks a guiding example. Perhaps:

char *line_buffer = NULL;
size_t bufsize = 0;
size_ sz;
while ((sz = readLine(stdin, &line_buffer, &bufsize)) > 0) {
 printf("Sz:%zu <%s>\n", sz, line_buffer);
}
free(line_buffer);

Naming consistency

3 styles in one line. Consider any 1 style.

readLine(FILE *istream, char **line_buffer, size_t *bufsize
readLine: camel case
line_buffer: `_` separation
bufsize: packed

Research restrict

This provide some self documentation and hints for compiler optimization.

///size_t readLine(FILE *istream, char **line_buffer, size_t *bufsize) {
size_t readLine(FILE * restrict istream, char ** restrict line_buffer, 
 size_t * restrict bufsize) {

Missing #include <>s

At least stdio.h, string.h, strlib.h

Good uses of size_t

Good formating

Style

ERRALLOC(*line_buffer) looks wrong without a trialing ;.

answered Jan 20, 2021 at 5:18
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.