How to refactor this (sort of) getline implementation?

Question 1

Goals:

Create a function char *get_next_line(int fd) such as:

Repeated calls (e.g., using a loop) to your get_next_line() function should let you read the text file pointed to by the file descriptor, one line at a time.
Your function should return the line that was read. If there is nothing else to read or if an error occurred, it should return NULL.
Make sure that your function works as expected both when reading a file and when reading from the standard input.
Please note that the returned line should include the terminating \n character, except if the end of file was reached and does not end with a \n character.
Because you will have to read files in get_next_line(), add this option to your compiler call: -D BUFFER_SIZE=n It will define the buffer size for read(). The buffer size value will be modified in order to test your code.
We consider that get_next_line() has an undefined behavior if the file pointed to by the file descriptor changed since the last call whereas read() didn’t reach the end of file.
We also consider that get_next_line() has an undefined behavior when reading a binary file. However, you can implement a logical way to handle this behavior if you want to.

Forbidden:

lseek() is forbidden.
Global variables are forbidden.
Any function other than read(), malloc() and free() is forbidden.

Note: I will use functions other than these 3 in my code, the idea here is to implement these functions ourselves as an exercise to understand how libc works. For simplicity sake, I am not going to show how these functions are implemented because they are quite trivial to implement, except maybe for realloc(), which I will show at the end.

What I am looking for:

Is my code correct?
How would you refactor this code into a bunch of small functions? In order to validate this assignment I need to split the logic into multiple helper functions less than 25 LOC each. For me this is the hardest part, I find it very annoying and I don't know what is a good way to refactor the logic without breaking my program or just making a very hard to read mess.
Any alternative ways to solve this problem? What would you have done instead?

Try not to take too much into account the style of my code, I am using this style guide from my school, it's not the prettiest thing but I have to. I am only posting this for people who are curious about the weird stylistic choices here, I don't recommend reading through this document unless you really care about that stuff.

Here is my code.

#ifndef BUFFER_SIZE
# define BUFFER_SIZE 4096
#endif
#include <unistd.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
typedef struct s_buffer
{
 char data[BUFFER_SIZE];
 char *beg;
 ptrdiff_t len;
} t_buffer;
ssize_t read_data(int fd, t_buffer *buf)
{
 buf->beg = buf->data;
 buf->len = read(fd, buf->beg, BUFFER_SIZE);
 return (buf->len);
}
char *get_next_line(int fd)
{
 static t_buffer buf;
 char *pos;
 ptrdiff_t chunk_len;
 char *line = NULL;
 ptrdiff_t line_len = 0;
 while (1)
 {
 // Either we consumed all bytes or we are on the first gnl call
 if (buf.len == 0)
 {
 // Either we reached EOF or there was a read error
 if (read_data(fd, &buf) <= 0)
 {
 // In case the very last line didn't contain a "\n" we return it
 if (line_len > 0)
 {
 line[line_len] = '0円';
 return (line);
 }
 free(line);
 return (NULL);
 }
 }
 // Find the newline character in the buffer
 pos = memchr(buf.beg, '\n', buf.len);
 // Newline found, happy path
 if (pos)
 {
 chunk_len = (pos + 1) - buf.beg;
 // Potentially reallocate the line if necessary
 char *tmp = realloc(line, line_len + chunk_len + 1);
 if (tmp == NULL)
 {
 free(line);
 return (NULL);
 }
 line = tmp;
 memcpy(line + line_len, buf.beg, chunk_len);
 line[line_len + chunk_len] = '0円';
 // Update processed data for the next gnl call
 buf.beg += chunk_len;
 buf.len -= chunk_len;
 return (line);
 }
 // No newline found, save all remaining data before next read
 chunk_len = buf.len;
 char *tmp = realloc(line, line_len + chunk_len + 1);
 if (tmp == NULL)
 {
 free(line);
 return (NULL);
 }
 line = tmp;
 memcpy(line + line_len, buf.beg, chunk_len);
 line_len += chunk_len;
 // Advance for the next read
 buf.beg += chunk_len;
 buf.len -= chunk_len;
 }
}

Here is a very basic test case.

#include <fcntl.h>
#include <stdio.h>
int main(void)
{
 int fd = open("test.txt", O_RDONLY);
 while (true)
 {
 char *line = get_next_line(fd);
 if (line == NULL)
 {
 break ;
 }
 printf("%s", line);
 free(line);
 }
 return (0);
}

And my implementation of realloc().

void *ft_realloc(void *ptr, size_t old_size, size_t new_size)
{
 void *new;
 if (ptr == NULL)
 {
 return (malloc(new_size));
 }
 else if (new_size == 0)
 {
 free(ptr);
 return (NULL);
 }
 else if (new_size <= old_size)
 {
 return (ptr);
 }
 new = malloc(new_size);
 if (new == NULL)
 {
 return (NULL);
 }
 ft_memcpy(new, ptr, old_size);
 free(ptr);
 return (new);
}

Question 2

Two tiny comments and an opinion: 1) Unnecessary use of if/else in 3rd function. if with return (or other flow control) does not need else. 2) return is NOT a function and use of parentheses is distracting. Opinion: That "style guide" is crap. You've written "must be used", but I'd suggest you avoid considering it "good advice" for the real world... (PS: new is a C++ keyword. No problem here, but VERY generic and uninformative.)

Question 3

PS: main() has transgressed several of the "style guide" prohibitions... You ask about 'factoring' the first function, so you are aware that it, too, breaks the rules... Code must work correctly. Arbitrary "style" constraints do NOT make algorithmic expression correct or comprehensible. Focus on "substance", not "style", please...

Question 4

I agree 100%, trust me, I hate this style guide more that anyone else and I even feel like I am writing way worse code because of it.

Question 5

@ismbks Re: style issues. Best to use your group's (i.e. prof's) style guide - even when it is faulty. Of course, when faulty, try to get the group to amend it. Failing that, either live with it (for now) or update your CV.

Question 6

Is my code correct?

[Added 2024 Nov 20]
static t_buffer buf is a significant problem

OP has "We consider that get_next_line() has an undefined behavior if the file pointed to by the file descriptor changed since the last call whereas read() didn’t reach the end of file." yet that precludes general use of get_next_line().

As part of function get_next_line(), this static object contains the remnant of prior reads. Yet correctly using that data depends on a consistent int fd.

Example: calling get_next_line(fd_file_a) and then get_next_line(fd_file_b) would mix input of 2 different files!

get_next_line() deserves a re-write that does not use static. Given that read() pays no special attention to '\n', I see little choice but to read 1 character at a time and incur a performance hit.

Review cases when reading a line ends.

Zero to several characters are read followed by a '\n' and returns line. Code appears OK.
No characters read and end-of-file occurs and returns NULL. Code appears OK.
More than 1 character read and end-of-file occurs and returns line. Code appears OK.
No characters read and input error occurs and returns NULL. Code appears OK. Yet now how does the calling code distinguish from reading the end-of-file?
More than 1 character read and input error occurs and returns line. Using fgets() as a model, this is incorrect. NULL should be returned.
Allocation fails and returns NULL. This is reasonable, yet calling code may have trouble distinguishing this from other NULL returns. Obliges calling code to to clear errno if errno state is relied on.

Lacking documentation

char *get_next_line(int fd), at least deserve an explanation as to its goals. Further, consider this in a corresponding .h file. A user of get_next_line() should not need to wade through the source code to get a top level view of the functionality.

Good use of mem...() functions rather the str...() ones.

That better handles the case when a null character is read.

Some ambiguity when a null character is read.

Perhaps consider conveying to the caller the length read.

Help caller to quickly lop off a trailing '\n'.
Distinguishes correctly a read null character from the appended one.

Test case improvement

Expand test case to show how the various cases are detected: Normal with or without '\n, no read with end-of-file, input error, allocation failure.

Geometric vs linear

Re-allocation calls incur O(n*n) time versus O(n*log(n)). line_len + chunk_len + 1 --> line_len*2 + 1 or the like. Robust code would add overflow protection.

Alternate

Somehow return the pointer and length read, using a negative length to indicate error (and not end-of-file).

Simplification

 if (new == NULL)
 {
 return (NULL);
 }
 ft_memcpy(new, ptr, old_size);

to

 if (new)
 {
 ft_memcpy(new, ptr, old_size);
 }

Reduced size call with ft_realloc()

IMO, this should attempt memory usage reduction.

Consider scaling to BUFSIZ

#include <stdio.h>
#define BUFFER_SIZE (BUFSIZ * 2)

Reducing to 25 LOC per function

First define/automate your LOC assessment method. Is if (ptr == NULL) { return malloc(new_size)); } 4 lines or what? There exist various algorithms.

Use an auto-formatter. IMO, there is tell-table manual formatting going on. That is unproductive.

2nd, get functionality fully determined.

Soap-box

I am not a fan of unlimited allocations and prefer defensive programming. I'd rather see a sane upper bound passed in, which could be SIZE_MAX for extreme cases. A sane, yet large limit is case dependent and the calling code has the best idea of what is sane. Unlimited allocations enables nefarious actors and delay error detection of foolish input.

Example, for a person name, a somewhat sane limit like 1000 is reasonable versus 2³² - 1.

This may be outside OP's current design limitations.

Question 7

Here's something (not) completely different: :-)

Question 8

Thanks a lot for going through my code, I seriously can't express how valuable it is to get feedback.

Question 9

About the 25 LOC, any line break is considered to be a LOC. So, the example you showed is technically 1 LOC but formatting expressions like this is against the rules. Anyways, I think this is the least interesting part of the project. It's always tricky when I ask a review for school work because there are many arbitrary rules that don't make sense for programmers on StackExchange, which is understandable. Good catch on inefficient reallocation, it is was a point I was also thinking about changing.

Question 10

I agree that the function prototype is limiting my ability to handle error returns as we can't differentiate between NULL for being done with reading the file and NULL for a read or allocation error. What would be a better prototype for this function in your opinion? char *get_next_line(int fd) was imposed as per the subject but beyond the scope of this subject I think it's interesting to talk about the flaws of this design and how to improve it.

Question 11

@ismbks The best prototype lives somewhere down the rabbit hole. Maybe more on that later.

chux chux 36.2k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2024-11-19 22:55:43Z

Is my code correct?

[Added 2024 Nov 20]
static t_buffer buf is a significant problem

OP has "We consider that get_next_line() has an undefined behavior if the file pointed to by the file descriptor changed since the last call whereas read() didn’t reach the end of file." yet that precludes general use of get_next_line().

As part of function get_next_line(), this static object contains the remnant of prior reads. Yet correctly using that data depends on a consistent int fd.

Example: calling get_next_line(fd_file_a) and then get_next_line(fd_file_b) would mix input of 2 different files!

get_next_line() deserves a re-write that does not use static. Given that read() pays no special attention to '\n', I see little choice but to read 1 character at a time and incur a performance hit.

Review cases when reading a line ends.

Zero to several characters are read followed by a '\n' and returns line. Code appears OK.
No characters read and end-of-file occurs and returns NULL. Code appears OK.
More than 1 character read and end-of-file occurs and returns line. Code appears OK.
No characters read and input error occurs and returns NULL. Code appears OK. Yet now how does the calling code distinguish from reading the end-of-file?
More than 1 character read and input error occurs and returns line. Using fgets() as a model, this is incorrect. NULL should be returned.
Allocation fails and returns NULL. This is reasonable, yet calling code may have trouble distinguishing this from other NULL returns. Obliges calling code to to clear errno if errno state is relied on.

Lacking documentation

char *get_next_line(int fd), at least deserve an explanation as to its goals. Further, consider this in a corresponding .h file. A user of get_next_line() should not need to wade through the source code to get a top level view of the functionality.

Good use of mem...() functions rather the str...() ones.

That better handles the case when a null character is read.

Some ambiguity when a null character is read.

Perhaps consider conveying to the caller the length read.

Help caller to quickly lop off a trailing '\n'.
Distinguishes correctly a read null character from the appended one.

Test case improvement

Expand test case to show how the various cases are detected: Normal with or without '\n, no read with end-of-file, input error, allocation failure.

Geometric vs linear

Re-allocation calls incur O(n*n) time versus O(n*log(n)). line_len + chunk_len + 1 --> line_len*2 + 1 or the like. Robust code would add overflow protection.

Alternate

Somehow return the pointer and length read, using a negative length to indicate error (and not end-of-file).

Simplification

 if (new == NULL)
 {
 return (NULL);
 }
 ft_memcpy(new, ptr, old_size);

to

 if (new)
 {
 ft_memcpy(new, ptr, old_size);
 }

Reduced size call with ft_realloc()

IMO, this should attempt memory usage reduction.

Consider scaling to BUFSIZ

#include <stdio.h>
#define BUFFER_SIZE (BUFSIZ * 2)

Reducing to 25 LOC per function

First define/automate your LOC assessment method. Is if (ptr == NULL) { return malloc(new_size)); } 4 lines or what? There exist various algorithms.

Use an auto-formatter. IMO, there is tell-table manual formatting going on. That is unproductive.

2nd, get functionality fully determined.

Soap-box

I am not a fan of unlimited allocations and prefer defensive programming. I'd rather see a sane upper bound passed in, which could be SIZE_MAX for extreme cases. A sane, yet large limit is case dependent and the calling code has the best idea of what is sane. Unlimited allocations enables nefarious actors and delay error detection of foolish input.

Example, for a person name, a somewhat sane limit like 1000 is reasonable versus 2³² - 1.

This may be outside OP's current design limitations.

Thanks a lot for going through my code, I seriously can't express how valuable it is to get feedback.
About the 25 LOC, any line break is considered to be a LOC. So, the example you showed is technically 1 LOC but formatting expressions like this is against the rules. Anyways, I think this is the least interesting part of the project. It's always tricky when I ask a review for school work because there are many arbitrary rules that don't make sense for programmers on StackExchange, which is understandable. Good catch on inefficient reallocation, it is was a point I was also thinking about changing.
I agree that the function prototype is limiting my ability to handle error returns as we can't differentiate between NULL for being done with reading the file and NULL for a read or allocation error. What would be a better prototype for this function in your opinion? char *get_next_line(int fd) was imposed as per the subject but beyond the scope of this subject I think it's interesting to talk about the flaws of this design and how to improve it.
@ismbks The best prototype lives somewhere down the rabbit hole. Maybe more on that later.

Stack Exchange Network

How to refactor this (sort of) getline implementation?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to refactor this (sort of) getline implementation?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions