Goals:
Create a function char *get_next_line(int fd)
such as:
- Repeated calls (e.g., using a loop) to your
get_next_line()
function should let you read the text file pointed to by the file descriptor, one line at a time. - Your function should return the line that was read.
If there is nothing else to read or if an error occurred, it should return
NULL
. - Make sure that your function works as expected both when reading a file and when reading from the standard input.
- Please note that the returned line should include the terminating
\n
character, except if the end of file was reached and does not end with a\n
character. - Because you will have to read files in
get_next_line()
, add this option to your compiler call:-D BUFFER_SIZE=n
It will define the buffer size forread()
. The buffer size value will be modified in order to test your code. - We consider that
get_next_line()
has an undefined behavior if the file pointed to by the file descriptor changed since the last call whereasread()
didn’t reach the end of file. - We also consider that
get_next_line()
has an undefined behavior when reading a binary file. However, you can implement a logical way to handle this behavior if you want to.
Forbidden:
lseek()
is forbidden.- Global variables are forbidden.
- Any function other than
read()
,malloc()
andfree()
is forbidden.
Note: I will use functions other than these 3 in my code, the idea here is to implement these functions ourselves as an exercise to understand how libc works. For simplicity sake, I am not going to show how these functions are implemented because they are quite trivial to implement, except maybe for realloc()
, which I will show at the end.
What I am looking for:
- Is my code correct?
- How would you refactor this code into a bunch of small functions? In order to validate this assignment I need to split the logic into multiple helper functions less than 25 LOC each. For me this is the hardest part, I find it very annoying and I don't know what is a good way to refactor the logic without breaking my program or just making a very hard to read mess.
- Any alternative ways to solve this problem? What would you have done instead?
Try not to take too much into account the style of my code, I am using this style guide from my school, it's not the prettiest thing but I have to. I am only posting this for people who are curious about the weird stylistic choices here, I don't recommend reading through this document unless you really care about that stuff.
Here is my code.
#ifndef BUFFER_SIZE
# define BUFFER_SIZE 4096
#endif
#include <unistd.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>
typedef struct s_buffer
{
char data[BUFFER_SIZE];
char *beg;
ptrdiff_t len;
} t_buffer;
ssize_t read_data(int fd, t_buffer *buf)
{
buf->beg = buf->data;
buf->len = read(fd, buf->beg, BUFFER_SIZE);
return (buf->len);
}
char *get_next_line(int fd)
{
static t_buffer buf;
char *pos;
ptrdiff_t chunk_len;
char *line = NULL;
ptrdiff_t line_len = 0;
while (1)
{
// Either we consumed all bytes or we are on the first gnl call
if (buf.len == 0)
{
// Either we reached EOF or there was a read error
if (read_data(fd, &buf) <= 0)
{
// In case the very last line didn't contain a "\n" we return it
if (line_len > 0)
{
line[line_len] = '0円';
return (line);
}
free(line);
return (NULL);
}
}
// Find the newline character in the buffer
pos = memchr(buf.beg, '\n', buf.len);
// Newline found, happy path
if (pos)
{
chunk_len = (pos + 1) - buf.beg;
// Potentially reallocate the line if necessary
char *tmp = realloc(line, line_len + chunk_len + 1);
if (tmp == NULL)
{
free(line);
return (NULL);
}
line = tmp;
memcpy(line + line_len, buf.beg, chunk_len);
line[line_len + chunk_len] = '0円';
// Update processed data for the next gnl call
buf.beg += chunk_len;
buf.len -= chunk_len;
return (line);
}
// No newline found, save all remaining data before next read
chunk_len = buf.len;
char *tmp = realloc(line, line_len + chunk_len + 1);
if (tmp == NULL)
{
free(line);
return (NULL);
}
line = tmp;
memcpy(line + line_len, buf.beg, chunk_len);
line_len += chunk_len;
// Advance for the next read
buf.beg += chunk_len;
buf.len -= chunk_len;
}
}
Here is a very basic test case.
#include <fcntl.h>
#include <stdio.h>
int main(void)
{
int fd = open("test.txt", O_RDONLY);
while (true)
{
char *line = get_next_line(fd);
if (line == NULL)
{
break ;
}
printf("%s", line);
free(line);
}
return (0);
}
And my implementation of realloc()
.
void *ft_realloc(void *ptr, size_t old_size, size_t new_size)
{
void *new;
if (ptr == NULL)
{
return (malloc(new_size));
}
else if (new_size == 0)
{
free(ptr);
return (NULL);
}
else if (new_size <= old_size)
{
return (ptr);
}
new = malloc(new_size);
if (new == NULL)
{
return (NULL);
}
ft_memcpy(new, ptr, old_size);
free(ptr);
return (new);
}
1 Answer 1
Is my code correct?
[Added 2024 Nov 20]
static t_buffer buf
is a significant problem
OP has "We consider that get_next_line()
has an undefined behavior if the file pointed to by the file descriptor changed since the last call whereas read() didn’t reach the end of file." yet that precludes general use of get_next_line()
.
As part of function get_next_line()
, this static
object contains the remnant of prior reads. Yet correctly using that data depends on a consistent int fd
.
Example: calling get_next_line(fd_file_a)
and then get_next_line(fd_file_b)
would mix input of 2 different files!
get_next_line()
deserves a re-write that does not use static
. Given that read()
pays no special attention to '\n'
, I see little choice but to read 1 character at a time and incur a performance hit.
Review cases when reading a line ends.
Zero to several characters are read followed by a
'\n'
and returnsline
. Code appears OK.No characters read and end-of-file occurs and returns
NULL
. Code appears OK.More than 1 character read and end-of-file occurs and returns
line
. Code appears OK.No characters read and input error occurs and returns
NULL
. Code appears OK. Yet now how does the calling code distinguish from reading the end-of-file?More than 1 character read and input error occurs and returns
line
. Usingfgets()
as a model, this is incorrect.NULL
should be returned.Allocation fails and returns
NULL
. This is reasonable, yet calling code may have trouble distinguishing this from otherNULL
returns. Obliges calling code to to clearerrno
iferrno
state is relied on.
Lacking documentation
char *get_next_line(int fd)
, at least deserve an explanation as to its goals. Further, consider this in a corresponding .h file. A user of get_next_line()
should not need to wade through the source code to get a top level view of the functionality.
Good use of mem...()
functions rather the str...()
ones.
That better handles the case when a null character is read.
Some ambiguity when a null character is read.
Perhaps consider conveying to the caller the length read.
Help caller to quickly lop off a trailing
'\n'
.Distinguishes correctly a read null character from the appended one.
Test case improvement
Expand test case to show how the various cases are detected: Normal with or without '\n
, no read with end-of-file, input error, allocation failure.
Geometric vs linear
Re-allocation calls incur O(n*n)
time versus O(n*log(n))
. line_len + chunk_len + 1
--> line_len*2 + 1
or the like. Robust code would add overflow protection.
Alternate
Somehow return the pointer and length read, using a negative length to indicate error (and not end-of-file).
Simplification
if (new == NULL)
{
return (NULL);
}
ft_memcpy(new, ptr, old_size);
to
if (new)
{
ft_memcpy(new, ptr, old_size);
}
Reduced size call with ft_realloc()
IMO, this should attempt memory usage reduction.
Consider scaling to BUFSIZ
#include <stdio.h>
#define BUFFER_SIZE (BUFSIZ * 2)
Reducing to 25 LOC per function
First define/automate your LOC
assessment method. Is if (ptr == NULL) { return malloc(new_size)); }
4 lines or what? There exist various algorithms.
Use an auto-formatter. IMO, there is tell-table manual formatting going on. That is unproductive.
2nd, get functionality fully determined.
I am not a fan of unlimited allocations and prefer defensive programming. I'd rather see a sane upper bound passed in, which could be SIZE_MAX
for extreme cases. A sane, yet large limit is case dependent and the calling code has the best idea of what is sane. Unlimited allocations enables nefarious actors and delay error detection of foolish input.
Example, for a person name, a somewhat sane limit like 1000 is reasonable versus 232 - 1.
This may be outside OP's current design limitations.
-
\$\begingroup\$ Here's something (not) completely different: :-) \$\endgroup\$user272752– user2727522024年11月20日 00:28:59 +00:00Commented Nov 20, 2024 at 0:28
-
\$\begingroup\$ Thanks a lot for going through my code, I seriously can't express how valuable it is to get feedback. \$\endgroup\$ismbks– ismbks2024年11月20日 13:33:56 +00:00Commented Nov 20, 2024 at 13:33
-
\$\begingroup\$ About the 25 LOC, any line break is considered to be a LOC. So, the example you showed is technically 1 LOC but formatting expressions like this is against the rules. Anyways, I think this is the least interesting part of the project. It's always tricky when I ask a review for school work because there are many arbitrary rules that don't make sense for programmers on StackExchange, which is understandable. Good catch on inefficient reallocation, it is was a point I was also thinking about changing. \$\endgroup\$ismbks– ismbks2024年11月20日 13:51:37 +00:00Commented Nov 20, 2024 at 13:51
-
\$\begingroup\$ I agree that the function prototype is limiting my ability to handle error returns as we can't differentiate between NULL for being done with reading the file and NULL for a read or allocation error. What would be a better prototype for this function in your opinion?
char *get_next_line(int fd)
was imposed as per the subject but beyond the scope of this subject I think it's interesting to talk about the flaws of this design and how to improve it. \$\endgroup\$ismbks– ismbks2024年11月20日 13:59:18 +00:00Commented Nov 20, 2024 at 13:59 -
\$\begingroup\$ @ismbks The best prototype lives somewhere down the rabbit hole. Maybe more on that later. \$\endgroup\$chux– chux2024年11月20日 15:40:19 +00:00Commented Nov 20, 2024 at 15:40
if/else
in 3rd function.if
withreturn
(or other flow control) does not needelse
. 2)return
is NOT a function and use of parentheses is distracting. Opinion: That "style guide" is crap. You've written "must be used", but I'd suggest you avoid considering it "good advice" for the real world... (PS:new
is a C++ keyword. No problem here, but VERY generic and uninformative.) \$\endgroup\$main()
has transgressed several of the "style guide" prohibitions... You ask about 'factoring' the first function, so you are aware that it, too, breaks the rules... Code must work correctly. Arbitrary "style" constraints do NOT make algorithmic expression correct or comprehensible. Focus on "substance", not "style", please... \$\endgroup\$