Memory-safe file-reading in C

Question 1

I am trying to write a single function that will read a certain number of bytes from a C file, or the entire file, based on the arguments. I want it to be memory-safe of course, and deal with NULL terminators and the like. Not only that, but it should be efficient.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *read(FILE *file, long bytes) {
 /* This function takes a FILE pointer <file> and reads <bytes> bytes from it. These
 * bytes are returned as a char* back to the calling function and should be freed
 * since they were malloc'd. This function was an attempt to recreate the
 * .read( [bytes] ) method on python file objects */
 /**********************************
 Set bytes to 0 to read entire file
 **********************************/
 long fileSize = 0; // File size to zero before byte check
 if (bytes == 0) {
 fseek(file, 0L, SEEK_END); // seek to end
 fileSize = ftell(file); // find size of file by checking end
 rewind(file); // back to the beginning for reading
 } else {
 fileSize = ++bytes; // extra byte for NULL 
 }
 char buffer[16]; // We will read the file 16 bytes at a time
 char *file_out = malloc(fileSize + 1);
 if (file_out == NULL) {
 fputs("Could not allocate enough memory", stderr);
 exit(EXIT_FAILURE);
 }
 while (fgets(buffer, 16, file) != NULL) {
 strncat(file_out, buffer, 16);
 }
 return file_out;
}

Is this efficient enough for large files and fringe cases?

Question 2

Why reading in chunks of 16 bytes, if you already allocated the memory for the whole file size? You just could read it in in one go using fread().

Question 3

Because I don't actually know what I'm doing but that's a good suggestion I see what you mean

Question 4

Tangential warning: It is very likely that your C library already has a function named read; if it does, it is a lower-level operation than fread. Because this function is widely known and used, naming your function read is likely to confuse people who need to read your code in the future.

Question 5

In addition to the other answers, you're allocating memory using malloc inside your function. This forces the caller to use free to delete the memory block when the caller may want to use a different allocator for memory management. I'd recommend changing your function's signature like this:

size_t read(FILE *file, long bytes, char *buffer, int bufferSize) { ... }

where buffer is the pointer to the target buffer and bufferSize is the number of bytes in that buffer.

This way you don't have to deal with allocating the block that you use to read into - the caller can fully manage it, using any memory-management functions. This frees up your return type from being a pointer (since the caller already knows where data will go) and now you can easily return number of bytes read or -1 on error.

Question 6

YES that's what I was looking for I couldn't think of it. Thank you

Question 7

I've gone and done some things does this look better.

Question 8

Why have two arguments bytes and bufferSize? You will eventually read at most bytes and at most bufferSize bytes, you might just collapse both into one.

Question 9

@Darkhogg if bytes is set to 0 (to read entire file) the caller needs to allocate enough bytes (without knowing the file size ahead of time) and the function needs to be able to verify that the allocated buffer is big enough to hold all data. If the buffer is too small, the function can return an error code that says so.

Question 10

This would leave the function as nothing but an inefficient wrapper around fread(3) and require the caller to do half of the work before each call. Returning allocated memory is idiomatic C; just be sure the function's documentation points out that this is happening.

Question 11

Instead of fgets, you can use fread to read directly into the file_out:

int totalRead = 0;
int size;
while( fileSize > total_read && (size = fread(file_out + totalRead, 1, fileSize - totalRead, file) != 0) {
 totalRead+=size;
}
if(totalRead < fileSize){
 //couldn't read entire file
}
file_out[total_read] = '0円';
return file_out;

This avoids the \$O(n^2)\$ complexity of doing strncat on a growing string.

Doing an exit from within a helper function is not recommended because the calling code cannot recover from it. Instead, return a special value that specifies something went wrong.

Also, make up your mind about the naming convention of camelCase vs. snake_case.

Question 12

This solves the biggest problem with the original code and ought to be the accepted answer. Using fgets when you want to read arbitrary binary data is absurd.

Question 13

If you'll be following the C way of doing things with read(), then you should probably have it return the number of bytes successfully read, preferably of type ssize_t. This would allow -1 to be returned if the read failed. It may also be useful to have it set errno accordingly, which would allow the user to print an appropriate error message via perror().

The user may expect the function to return something instead of just having the program (specifically the current process) exit. It may work for you, but it could still be a burden for others who are using it.

Question 14

How do I get the text to update if I don't return though

Question 15

@Tom: Same thing that that linked function does: pass the buffer as an argument.

Question 16

In my opinion, your function is not really ok - it's not efficient, and there's a few flaws. It's not bad, but could do with some, uh... finishing touches.

You don't check the <file> variable for being NULL on entry. I personally like to ensure anything coming in is sanity-checked before use. Somewhat akin to Eiffel's "Preconditions".

If the malloc() fails the function just exits the whole executable. It would be better to just return NULL. It should be up to the core-body of the programme to decide on low-memory handling.

Ratchet Freak's post above covers the major problem with the strncat() and \$O(N^2)\$ time requirements. I want to stress that this is really important - what he means is you have to re-process everything already read each time the next block of bytes is added to the end. Without significantly modifying your function, maintaining a pointer to the end of the string would be enough to fix this issue. For small files, this re-processing makes no significant difference - but what about when that file is 30 gigabytes.

Minor issues:

You're reading a tiny amount of the file at once - 16 bytes. Why not 512 bytes? 8k? etc.
You allocate +1 bytes (presumably for the 0円), twice. (What if the caller is reading a binary file?)
If the caller asks to read 500 bytes from the file, but it only has 5 bytes, the function still allocates 501 bytes.
What happens when the size of the file is more than the size of a signed long? Using unsigned long helps (and you can't have a negatively-sized file).
There is no way of signalling an I/O error. But for a function like this, a success OR failure is not too bad.

What you are trying to achieve with the file-reading loop in this function is already covered by the C function fread().

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/*
 * Read <bytes> of data from the given file, or the entire
 * content if <bytes> is zero. Returns a pointer to a zero-terminated
 * string on success, NULL otherwise.
 */
char *gulpFile(FILE *file, unsigned long bytes) 
{
 char *file_out = NULL;
 unsigned long file_size = 0; // File size to zero before byte check
 if (file != NULL)
 {
 // We need to know the size of the file
 fseek(file, 0L, SEEK_END); 
 file_size = ftell(file); 
 rewind(file); 
 if (bytes == 0)
 {
 file_size = bytes; 
 }
 else if (bytes < file_size)
 {
 file_size = bytes;
 }
 file_out = (char *)malloc(1 + file_size);
 if (file_out != NULL)
 {
 file_out[file_size] = '0円'; // terminated.
 size_t bytes_read = fread(file_out, 1, file_size, file);
 if (bytes_read != file_size)
 {
 // Some kind of I/O error 
 free(file_out);
 file_out = NULL;
 }
 }
 }
 return file_out;
}

Question 17

This code need not work, per the C standard:

 fseek(file, 0L, SEEK_END); // seek to end
 fileSize = ftell(file); // find size of file by checking end
 rewind(file); // back to the beginning for reading

Per 7.21.9.2 The fseek function:

A binary stream need not meaningfully support fseek calls with a whence value of SEEK_END.

fseek() to SEEK_END on a binary stream isn't guaranteed to work.

And, per 7.21.9.4 The ftell function:

For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.

ftell() on a text stream doesn't return the number of characters you need to read.

So fseek( ..., SEEK_END ); on a binary stream isn't guaranteed to work, and ftell() on a text stream doesn't tell you how many bytes you need to read.

Any claims to the contrary usually rely on POSIX-specified behavior, which means you can measure the size of a file directly using one of the POSIX-specified stat family of library calls.

Also, the fseek() call will discard any buffered data that the calling code might have written to the file, and will also modify the current file position. Something like this will do what you want per your question, without such side effects (note the use of an integral file descriptor instead of a FILE * - you can get the file descriptor to the underlying file for a FILE * using fileno( FILE * )):

char *readEntireFile( int fd )
{
 struct stat sb;
 int rc = fstat( fd, &sb );
 if ( rc < 0 )
 {
 return( NULL );
 }
 // use calloc() so the virtual pages are created before
 // trying to put data into them (really fast devices can
 // deliver data faster than VM managers might be able to
 // create virtual pages and map them to physical RAM)
 //
 // and calloc() NUL-terminates the data...
 char *data = calloc( 1, sb.st_size + 1 );
 if ( data == NULL )
 {
 return( NULL );
 }
 // use pread() so we don't need to know nor change the
 // current offset
 ssize_t bytes_read = pread( fd, data, sb.st_size, ( off_t ) 0 );
 // error checking here
 return( data );
}

As others have noted, don't call your function read(). That name is already used.

I've also made no effort to address the issues others have raised about conveying the number of bytes actually read.

xxbbcc xxbbcc 2063 silver badges6 bronze badges · Accepted Answer · 2015-12-02 19:49:04Z

9

\$\begingroup\$

In addition to the other answers, you're allocating memory using malloc inside your function. This forces the caller to use free to delete the memory block when the caller may want to use a different allocator for memory management. I'd recommend changing your function's signature like this:

size_t read(FILE *file, long bytes, char *buffer, int bufferSize) { ... }

where buffer is the pointer to the target buffer and bufferSize is the number of bytes in that buffer.

This way you don't have to deal with allocating the block that you use to read into - the caller can fully manage it, using any memory-management functions. This frees up your return type from being a pointer (since the caller already knows where data will go) and now you can easily return number of bytes read or -1 on error.

Share

edited Dec 3, 2015 at 1:10

Jamal's user avatar

Jamal

35.2k13 gold badges134 silver badges238 bronze badges

answered Dec 2, 2015 at 19:49

xxbbcc's user avatar

xxbbcc xxbbcc

2063 silver badges6 bronze badges

\$\endgroup\$

6

\$\begingroup\$ YES that's what I was looking for I couldn't think of it. Thank you \$\endgroup\$

Tom
– Tom

2015年12月02日 19:50:49 +00:00
Commented Dec 2, 2015 at 19:50
\$\begingroup\$ I've gone and done some things does this look better. \$\endgroup\$

Tom
– Tom

2015年12月02日 19:57:27 +00:00
Commented Dec 2, 2015 at 19:57
\$\begingroup\$ Why have two arguments bytes and bufferSize? You will eventually read at most bytes and at most bufferSize bytes, you might just collapse both into one. \$\endgroup\$

Dani
– Dani

2015年12月02日 23:03:18 +00:00
Commented Dec 2, 2015 at 23:03
\$\begingroup\$ @Darkhogg if bytes is set to 0 (to read entire file) the caller needs to allocate enough bytes (without knowing the file size ahead of time) and the function needs to be able to verify that the allocated buffer is big enough to hold all data. If the buffer is too small, the function can return an error code that says so. \$\endgroup\$

xxbbcc
– xxbbcc

2015年12月02日 23:20:44 +00:00
Commented Dec 2, 2015 at 23:20
1

\$\begingroup\$ This would leave the function as nothing but an inefficient wrapper around fread(3) and require the caller to do half of the work before each call. Returning allocated memory is idiomatic C; just be sure the function's documentation points out that this is happening. \$\endgroup\$

Blrfl
– Blrfl

2015年12月03日 12:21:14 +00:00
Commented Dec 3, 2015 at 12:21

| Show 1 more comment

Stack Exchange Network

Memory-safe file-reading in C

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Memory-safe file-reading in C

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions