I am trying to write a single function that will read a certain number of bytes from a C file, or the entire file, based on the arguments. I want it to be memory-safe of course, and deal with NULL
terminators and the like. Not only that, but it should be efficient.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *read(FILE *file, long bytes) {
/* This function takes a FILE pointer <file> and reads <bytes> bytes from it. These
* bytes are returned as a char* back to the calling function and should be freed
* since they were malloc'd. This function was an attempt to recreate the
* .read( [bytes] ) method on python file objects */
/**********************************
Set bytes to 0 to read entire file
**********************************/
long fileSize = 0; // File size to zero before byte check
if (bytes == 0) {
fseek(file, 0L, SEEK_END); // seek to end
fileSize = ftell(file); // find size of file by checking end
rewind(file); // back to the beginning for reading
} else {
fileSize = ++bytes; // extra byte for NULL
}
char buffer[16]; // We will read the file 16 bytes at a time
char *file_out = malloc(fileSize + 1);
if (file_out == NULL) {
fputs("Could not allocate enough memory", stderr);
exit(EXIT_FAILURE);
}
while (fgets(buffer, 16, file) != NULL) {
strncat(file_out, buffer, 16);
}
return file_out;
}
Is this efficient enough for large files and fringe cases?
5 Answers 5
In addition to the other answers, you're allocating memory using malloc
inside your function. This forces the caller to use free
to delete the memory block when the caller may want to use a different allocator for memory management. I'd recommend changing your function's signature like this:
size_t read(FILE *file, long bytes, char *buffer, int bufferSize) { ... }
where buffer
is the pointer to the target buffer and bufferSize
is the number of bytes in that buffer.
This way you don't have to deal with allocating the block that you use to read into - the caller can fully manage it, using any memory-management functions. This frees up your return type from being a pointer (since the caller already knows where data will go) and now you can easily return number of bytes read or -1
on error.
-
\$\begingroup\$ YES that's what I was looking for I couldn't think of it. Thank you \$\endgroup\$Tom– Tom2015年12月02日 19:50:49 +00:00Commented Dec 2, 2015 at 19:50
-
\$\begingroup\$ I've gone and done some things does this look better. \$\endgroup\$Tom– Tom2015年12月02日 19:57:27 +00:00Commented Dec 2, 2015 at 19:57
-
\$\begingroup\$ Why have two arguments
bytes
andbufferSize
? You will eventually read at mostbytes
and at mostbufferSize
bytes, you might just collapse both into one. \$\endgroup\$Dani– Dani2015年12月02日 23:03:18 +00:00Commented Dec 2, 2015 at 23:03 -
\$\begingroup\$ @Darkhogg if
bytes
is set to0
(to read entire file) the caller needs to allocate enough bytes (without knowing the file size ahead of time) and the function needs to be able to verify that the allocated buffer is big enough to hold all data. If the buffer is too small, the function can return an error code that says so. \$\endgroup\$xxbbcc– xxbbcc2015年12月02日 23:20:44 +00:00Commented Dec 2, 2015 at 23:20 -
1\$\begingroup\$ This would leave the function as nothing but an inefficient wrapper around
fread(3)
and require the caller to do half of the work before each call. Returning allocated memory is idiomatic C; just be sure the function's documentation points out that this is happening. \$\endgroup\$Blrfl– Blrfl2015年12月03日 12:21:14 +00:00Commented Dec 3, 2015 at 12:21
Instead of fgets
, you can use fread
to read directly into the file_out
:
int totalRead = 0;
int size;
while( fileSize > total_read && (size = fread(file_out + totalRead, 1, fileSize - totalRead, file) != 0) {
totalRead+=size;
}
if(totalRead < fileSize){
//couldn't read entire file
}
file_out[total_read] = '0円';
return file_out;
This avoids the \$O(n^2)\$ complexity of doing strncat
on a growing string.
Doing an exit from within a helper function is not recommended because the calling code cannot recover from it. Instead, return a special value that specifies something went wrong.
Also, make up your mind about the naming convention of camelCase vs. snake_case.
-
\$\begingroup\$ This solves the biggest problem with the original code and ought to be the accepted answer. Using
fgets
when you want to read arbitrary binary data is absurd. \$\endgroup\$Matti Virkkunen– Matti Virkkunen2015年12月03日 00:47:06 +00:00Commented Dec 3, 2015 at 0:47
If you'll be following the C way of doing things with read()
, then you should probably have it return the number of bytes successfully read, preferably of type ssize_t
. This would allow -1
to be returned if the read failed. It may also be useful to have it set errno
accordingly, which would allow the user to print an appropriate error message via perror()
.
The user may expect the function to return something instead of just having the program (specifically the current process) exit. It may work for you, but it could still be a burden for others who are using it.
-
\$\begingroup\$ How do I get the text to update if I don't return though \$\endgroup\$Tom– Tom2015年12月02日 19:46:26 +00:00Commented Dec 2, 2015 at 19:46
-
\$\begingroup\$ @Tom: Same thing that that linked function does: pass the buffer as an argument. \$\endgroup\$Jamal– Jamal2015年12月02日 19:50:10 +00:00Commented Dec 2, 2015 at 19:50
In my opinion, your function is not really ok - it's not efficient, and there's a few flaws. It's not bad, but could do with some, uh... finishing touches.
You don't check the <file>
variable for being NULL
on entry. I personally like to ensure anything coming in is sanity-checked before use. Somewhat akin to Eiffel's "Preconditions".
If the malloc()
fails the function just exits the whole executable. It would be better to just return NULL
. It should be up to the core-body of the programme to decide on low-memory handling.
Ratchet Freak's post above covers the major problem with the strncat()
and \$O(N^2)\$ time requirements. I want to stress that this is really important - what he means is you have to re-process everything already read each time the next block of bytes is added to the end. Without significantly modifying your function, maintaining a pointer to the end of the string would be enough to fix this issue. For small files, this re-processing makes no significant difference - but what about when that file is 30 gigabytes.
Minor issues:
- You're reading a tiny amount of the file at once - 16 bytes. Why not 512 bytes? 8k? etc.
- You allocate +1 bytes (presumably for the
0円
), twice. (What if the caller is reading a binary file?) - If the caller asks to read 500 bytes from the file, but it only has 5 bytes, the function still allocates 501 bytes.
- What happens when the size of the file is more than the size of a signed long? Using
unsigned long
helps (and you can't have a negatively-sized file). - There is no way of signalling an I/O error. But for a function like this, a success OR failure is not too bad.
What you are trying to achieve with the file-reading loop in this function is already covered by the C function fread()
.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*
* Read <bytes> of data from the given file, or the entire
* content if <bytes> is zero. Returns a pointer to a zero-terminated
* string on success, NULL otherwise.
*/
char *gulpFile(FILE *file, unsigned long bytes)
{
char *file_out = NULL;
unsigned long file_size = 0; // File size to zero before byte check
if (file != NULL)
{
// We need to know the size of the file
fseek(file, 0L, SEEK_END);
file_size = ftell(file);
rewind(file);
if (bytes == 0)
{
file_size = bytes;
}
else if (bytes < file_size)
{
file_size = bytes;
}
file_out = (char *)malloc(1 + file_size);
if (file_out != NULL)
{
file_out[file_size] = '0円'; // terminated.
size_t bytes_read = fread(file_out, 1, file_size, file);
if (bytes_read != file_size)
{
// Some kind of I/O error
free(file_out);
file_out = NULL;
}
}
}
return file_out;
}
This code need not work, per the C standard:
fseek(file, 0L, SEEK_END); // seek to end
fileSize = ftell(file); // find size of file by checking end
rewind(file); // back to the beginning for reading
Per 7.21.9.2 The fseek
function:
A binary stream need not meaningfully support
fseek
calls with awhence
value ofSEEK_END
.
fseek()
to SEEK_END
on a binary stream isn't guaranteed to work.
And, per 7.21.9.4 The ftell
function:
For a text stream, its file position indicator contains unspecified information, usable by the
fseek
function for returning the file position indicator for the stream to its position at the time of theftell
call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.
ftell()
on a text stream doesn't return the number of characters you need to read.
So fseek( ..., SEEK_END );
on a binary stream isn't guaranteed to work, and ftell()
on a text stream doesn't tell you how many bytes you need to read.
Any claims to the contrary usually rely on POSIX-specified behavior, which means you can measure the size of a file directly using one of the POSIX-specified stat
family of library calls.
Also, the fseek()
call will discard any buffered data that the calling code might have written to the file, and will also modify the current file position. Something like this will do what you want per your question, without such side effects (note the use of an integral file descriptor instead of a FILE *
- you can get the file descriptor to the underlying file for a FILE *
using fileno( FILE * )
):
char *readEntireFile( int fd )
{
struct stat sb;
int rc = fstat( fd, &sb );
if ( rc < 0 )
{
return( NULL );
}
// use calloc() so the virtual pages are created before
// trying to put data into them (really fast devices can
// deliver data faster than VM managers might be able to
// create virtual pages and map them to physical RAM)
//
// and calloc() NUL-terminates the data...
char *data = calloc( 1, sb.st_size + 1 );
if ( data == NULL )
{
return( NULL );
}
// use pread() so we don't need to know nor change the
// current offset
ssize_t bytes_read = pread( fd, data, sb.st_size, ( off_t ) 0 );
// error checking here
return( data );
}
As others have noted, don't call your function read()
. That name is already used.
I've also made no effort to address the issues others have raised about conveying the number of bytes actually read.
fread()
. \$\endgroup\$read
; if it does, it is a lower-level operation thanfread
. Because this function is widely known and used, naming your functionread
is likely to confuse people who need to read your code in the future. \$\endgroup\$