I've recently finished a small project intended to better understand C. The program will output the ten largest files found (recursively in the event of subdirectories) in a specified directory.
I've just improved my code to use dynamic memory allocation instead of allocating (statically) a massive array. Please critique not only my use of dynamic memory allocation, but any additional anomalies or smells in terms of C conventions. The program compiles, runs, and outputs identical findings compared to the same program I implemented using Rust.
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>
#include "fs_info.h"
/*
A struct to contain the name of a filesystem entry and its size in bytes.
An array of this type will be used to catalog all filesystem entries for
the directory specified as command line argument.
*/
struct FS_Info {
char name[1024];
int size;
};
// initial size of array used to contain filesystem entries
int N = 100;
// number of largest entries to output
const int num_entries = 10;
// global pointer to current FS_Info array
// will be updated as the array is resized using dynamic memory allocation
struct FS_Info *curr_fs_info_ptr;
// determine the size of the info struct, used for qsort
const int info_size = sizeof(struct FS_Info);
// used to sort fs_entries array descending in terms of entry size
int compare(const void *a, const void *b)
{
struct FS_Info *entryA = (struct FS_Info *)a;
struct FS_Info *entryB = (struct FS_Info *)b;
return (entryB->size - entryA->size);
}
void get_size(char *path)
{
static int items_added = 0;
struct stat st;
if (stat(path, &st) == 0)
{
if (items_added < N) // if array capacity will not be exceeded
{
strcpy(curr_fs_info_ptr[items_added].name, path);
curr_fs_info_ptr[items_added].size = st.st_size;
items_added++;
}
else // double the size of the containing array
{
// puts("Re-allocating array to fit additional fs_entries");
N *= 2;
struct FS_Info *resized_fs_entries = realloc(curr_fs_info_ptr, N * sizeof(*curr_fs_info_ptr));
if (resized_fs_entries)
{
curr_fs_info_ptr = resized_fs_entries;
strcpy(curr_fs_info_ptr[items_added].name, path);
curr_fs_info_ptr[items_added].size = st.st_size;
items_added++;
}
else
{
puts("An error occurred when attempting to resize the array!");
exit(-1);
}
}
}
else
{
printf("Error getting stat for entry %s: %d\n", path, stat(path, &st));
}
}
void walk(const char *currDir)
{
DIR *dir = opendir(currDir);
struct dirent *entry;
if (dir == NULL) // directory could not be opened
{
return;
}
while ((entry = readdir(dir)) != NULL)
{
if (strcmp(entry->d_name, ".") == 0 || strcmp(entry->d_name, "..") == 0)
// if directory is current dir or parent dir
{
continue;
}
char path_to_entry[1024];
snprintf(path_to_entry, sizeof(path_to_entry), "%s/%s", currDir, entry->d_name);
// use path_to_entry to call stats on the entry
get_size(path_to_entry); // fs_entries will only be used on first call
if (entry->d_type == DT_DIR)
{
// recursively visit subdirectories
walk(path_to_entry);
}
}
closedir(dir);
}
int main(int argc, char *argv[])
{
if (argc != 2)
{
printf("Usage: %s <target directory>\n", argv[0]);
}
const char *target_dir = argv[1];
printf("Finding %d largest files in: %s\n", num_entries, target_dir);
/* Create a pointer to the start of the FS_Info array and set the global
variable curr_fs_info_ptr to this memory address
*/
struct FS_Info *fs_entries = malloc(N * sizeof(struct FS_Info));
curr_fs_info_ptr = fs_entries;
// recursively visit all entries in the specified directory
walk(target_dir);
// sort the entries descending by file size
qsort(curr_fs_info_ptr, N, info_size, compare);
// output ten largest files found
for (int i = 0; i < num_entries; i++)
{
printf("%s\t%d\n", curr_fs_info_ptr[i].name, curr_fs_info_ptr[i].size);
}
free(curr_fs_info_ptr); // good practice, but program is over and OS will reclaim anyways
return 0;
}
-
\$\begingroup\$ Please do not edit the question, especially the code, after an answer has been posted. Changing the question may cause answer invalidation. Everyone needs to be able to see what the reviewer was referring to. What to do after the question has been answered. \$\endgroup\$pacmaninbw– pacmaninbw ♦2022年12月26日 14:30:28 +00:00Commented Dec 26, 2022 at 14:30
-
1\$\begingroup\$ @pacmaninbw, thanks for bringing this to my attention, and my sincere apologies. I had placed the struct definition in a header file when working on this originally, and missed the include statement when posting. Thanks for taking the time to provide such an instructive review. \$\endgroup\$John Harrington– John Harrington2022年12月26日 20:51:48 +00:00Commented Dec 26, 2022 at 20:51
5 Answers 5
General Observations
Good job checking all possible compiler errors and warnings.
It appears that the contents of fs_info.h
are already included, so that include should be removed. In the future include those include files that you have written.
The code isn't completely portable because the header file dirent.h
does not exist on Windows 10 systems using Visual Studio 2022. It does compile fine when the include for fs_info.h
is removed on Linux systems (Ubuntu in my case).
Rather than return 0;
at the end of the program the use of EXIT_SUCCESS
will make the code more readable, in the examples below I use EXIT_FAILURE
in the cases where the program should exit due to a problem. These symbolic constants are defined in stdlib.h
for the C programming language and cstdlib
for C++.
Error Handling
There is insufficient error handling, at least in the case where the number of arguments is checked. The program should exit at this point if there is an error, due to the lack of error handling when a user doesn't provide a directory to search the code crashes.
Errors should be reported to stderr
rather than stdout
so instead of using printf()
errors should be printed using fprintf(stderr, ERROR_MESSAGE);
if (argc != 2)
{
printf("Usage: %s <target directory>\n", argv[0]);
return EXIT_FAILURE;
}
Test for Possible Memory Allocation Errors
In modern high-level languages such as C++, memory allocation errors throw an exception that the programmer can catch. This is not the case in the C programming language. While it is rare in modern computers because there is so much memory, memory allocation can fail, especially if the code is working in a limited memory application such as embedded control systems. In the C programming language when memory allocation fails, the functions malloc()
, calloc()
and realloc()
return NULL
. Referencing any memory address through a NULL
pointer results in undefined behavior (UB).
Possible unknown behavior in this case can be a memory page error (in Unix this would be call Segmentation Violation), corrupted data in the program and in very old computers it could even cause the computer to reboot (corruption of the stack pointer).
To prevent this undefined behavior a best practice is to always follow the memory allocation statement with a test that the pointer that was returned is not null.
/* Create a pointer to the start of the FS_Info array and set the global
variable curr_fs_info_ptr to this memory address
*/
struct FS_Info* fs_entries = malloc(N * sizeof(struct FS_Info));
if (!fs_entries)
{
fprintf(stderr, "Malloc of fs_entries failed in main\n");
return EXIT_FAILURE;
}
Note Personally I would have used calloc( size_t num, size_t size )
rather than malloc()
in the memory allocation above. One of the benefits of calloc()
is that it initializes the memory allocated to null values.
Convention When Using Memory Allocation in C
When using malloc()
, calloc()
or realloc()
in C a common convention is to use sizeof *PTR
rather than sizeof(PTR_TYPE)
; this makes the code easier to maintain and less error prone, since less editing is required if the type of the pointer changes.
struct FS_Info* fs_entries = malloc(N * sizeof(*fs_entries));
if (!fs_entries)
{
fprintf(stderr, "Malloc of fs_entries failed in main\n");
return EXIT_FAILURE;
}
In the code there is a constant called info_size
for qsort()
; it might be better to use
qsort(curr_fs_info_ptr, N, sizeof(*curr_fs_info_ptr), compare);
Avoid Global Variables
It is very difficult to read, write, debug and maintain programs that use global variables. Global variables can be modified by any function within the program and therefore require each function to be examined before making changes in the code. In C and C++ global variables impact the namespace and they can cause linking errors if they are defined in multiple files. The answers to this Stack Overflow question provide a fuller explanation.
Variable Names
There must be a more descriptive variable name than N
for the number of items in the array.
The function name get_size()
is misleading, since there is a call to realloc()
in this function. It isn't just getting the size; it is changing the size. Which size is it supposed to get, the size of the file or the size of the allocated array?
-
2\$\begingroup\$ Er,
calloc()
initialises with zero bytes, which are not necessarily "null values" for the relevant type... \$\endgroup\$Toby Speight– Toby Speight2022年12月26日 11:54:25 +00:00Commented Dec 26, 2022 at 11:54
Your implementation of dynamic memory allocation looks good. Well organized and easy to follow. There are a few things I noticed:
- You should check for errors for
opendir
andreaddir
. These functions can fail, you should check and handle the error appropriatly - The same you should to when using
malloc
this will returnNULL
if it was not successfull - You should replace
stat
withfstat
to get the file information.fstat
is faster because it avoid the overhead ofstat
which includes opening and closing a file- If you want to read in files bigger than the maximum of the
st_size
field of thestat
struct which is about 9200 petabytes you need to usefopen
andfread
.
- If you want to read in files bigger than the maximum of the
- The variable
N
should be used as a parameter to theget_size
function instead of beeing used as a global variable, to make the function more modular and easier to reuse - You could also use
qsort_r
instead ofqsort
for the array ofFS_Info
structs.qsort_r
allows to pass a comparison function as a parameter, so you don't need to rely on a global function likecompare
. This would make it more convenient if you need to sort multiple arrays with different comparison functions. - It is also a good practice to use
size_t
for array indices and array sizes, because it is an unsigned integer type which is guaranteed to represent the size of any object in memory. - It would also be better if you define
FS_Info
as a type withtypedef
like this
typedef struct {
char name[1024];
int size;
} FS_Info;
- You could also use
perror
instead ofputs
. This will print the error messages corresponding to the value oferrno
. This could be more informative. - You could also use
strncpy
instead ofstrcpy
in theget_size
function to prevent buffer overflows. If you leave your code like that is not neccesary because you are already usingsnprintf
to preventpath_to_entry
to overflow so therefore the string at thepath
pointer can't be longer than thename
buffer in theFS_Info
struct. It is still a good practice to usestrncpy
strncpy(curr_fs_info_ptr[items_added].name, path, sizeof(curr_fs_info_ptr[items_added].name) - 1);
- Since C strings are null-terminated you should substract 1 from the size of the destination buffer to make sure there is enough room for the null terminator at the end.
char path_to_entry[1024];
snprintf(path_to_entry, sizeof(path_to_entry) - 1, "%s/%s", currDir, entry->d_name);
path_to_entry[sizeof(path_to_entry) - 1] = '0円';
-
\$\begingroup\$
strncpy()
costs the O(sizeof ,name[1024]
). Instead considerstrncat()
,snprintf()
. \$\endgroup\$chux– chux2022年12月27日 20:55:51 +00:00Commented Dec 27, 2022 at 20:55
When I compile this program (after removing the #include
of the missing and unnecessary header), I get a few warnings about unsafe numeric conversions:
gcc-12 -std=c17 -fPIC -gdwarf-4 -g -Wall -Wextra -Wwrite-strings -Wno-parentheses -Wpedantic -Warray-bounds -Wconversion -Wstrict-prototypes -fanalyzer -D_DEFAULT_SOURCE 282156.c -o 282156
282156.c: In function ‘get_size’:
282156.c:50:50: warning: conversion from ‘__off_t’ {aka ‘long int’} to ‘int’ may change value [-Wconversion]
50 | curr_fs_info_ptr[items_added].size = st.st_size;
| ^~
282156.c:59:78: warning: conversion to ‘long unsigned int’ from ‘int’ may change the sign of the result [-Wsign-conversion]
59 | struct FS_Info *resized_fs_entries = realloc(curr_fs_info_ptr, N * sizeof(*curr_fs_info_ptr));
| ^
282156.c:65:54: warning: conversion from ‘__off_t’ {aka ‘long int’} to ‘int’ may change value [-Wconversion]
65 | curr_fs_info_ptr[items_added].size = st.st_size;
| ^~
282156.c: In function ‘main’:
282156.c:129:43: warning: conversion to ‘long unsigned int’ from ‘int’ may change the sign of the result [-Wsign-conversion]
129 | struct FS_Info *fs_entries = malloc(N * sizeof(struct FS_Info));
| ^
282156.c:136:29: warning: conversion to ‘size_t’ {aka ‘long unsigned int’} from ‘int’ may change the sign of the result [-Wsign-conversion]
136 | qsort(curr_fs_info_ptr, N, info_size, compare);
| ^
282156.c:136:32: warning: conversion to ‘size_t’ {aka ‘long unsigned int’} from ‘int’ may change the sign of the result [-Wsign-conversion]
136 | qsort(curr_fs_info_ptr, N, info_size, compare);
| ^~~~~~~~~
These should be easy enough to fix.
Let's have a look at the comparison function:
int compare(const void *a, const void *b) { struct FS_Info *entryA = (struct FS_Info *)a; struct FS_Info *entryB = (struct FS_Info *)b; return (entryB->size - entryA->size); }
Although our program is a single translation unit, it's good to get into the habit of declaring them with internal linkage (using the static
keyword) if they are not intended to be shared with other TUs.
Don't cast away const
like that. In fact, it's safer to just use implicit conversion from const void*
which will fail to compile if you omit const
.
We do nothing to avoid the subtraction overflowing, which causes Undefined Behaviour. The standard idiom to avoid that is to subtract the results of two comparisons:
static int compare(const void *pa, const void *pb)
{
const struct FS_Info *a = pa;
const struct FS_Info *b = pb;
return (b->size > a->size) - (a->size > b->size);
}
Usability could be better. For example, when realloc()
fails, we do this:
puts("An error occurred when attempting to resize the array!"); exit(-1);
But "resizing the array" is meaningless to the user, who knows nothing about the implementation. Think about what they need to know. And program exit codes are normally small positive integers; on Linux systems at least, that -1
will end up converted to a larger positive number.
I'd replace that code with
fputs("error: Out of memory\n", stderr);
exit(EXIT_FAILURE);
Another usability improvement would be to default the program argument to use the current directory if none was specified, rather than printing to standard output and then proceeding to dereference a null pointer.
I don't like the magic numbers scattered through the program, especially char name[1024]
and char path_to_entry[1024]
. On many systems, that's insufficient for a path name, and I don't see any guard to prevent buffer overrun.
Sorting the full list of files requires enough memory to hold FS_Info
structures for all these files simultaneously, but we can avoid this unlimited reallocation. There's a quick fix: where we currently reallocate, we could sort the entries and discard all but the top ten, leaving empty space to read in another batch. A more involved, but better, implementation is to stop using qsort()
and instead implement a heap, retaining only the top ten items at all times.
-
\$\begingroup\$ Thanks for your feedback. I've replaced
char path_to_entry[1024]
withchar path_to_entry[PATH_MAX]
as this seems like a platform independent way of ensuring a given path will have sufficient space. I'm also usingsnprintf()
to read a path name as it guards against buffer overflow. Next, I will implement a heap. \$\endgroup\$John Harrington– John Harrington2022年12月27日 00:05:03 +00:00Commented Dec 27, 2022 at 0:05 -
\$\begingroup\$ Paths can still exceed
PATH_MAX
, so you can't forego the checking! \$\endgroup\$Toby Speight– Toby Speight2022年12月27日 11:10:44 +00:00Commented Dec 27, 2022 at 11:10
Why are you building a giant list of file sizes and then sorting them? Why should this program chew up memory like that? You want the 10 largest files, you need an array that will hold the 10 largest file sizes and pointers to their names. No more.
Compare the size of the current file to the size of the smallest of the 10 and if it is less, drop it and move on. If it is larger than that file, compare it to the next smallest and keep walking up the list. Once you determine it is going to be one of the top 10, you can use bsearch to figure out where in the top 10 it goes as efficiently as possible.
Malloc isn't free, realloc less so, Allocate N slots for the top N largest files on the stack with alloca and don't use any more memory than that. You only need enough memory for each of the 10 files you will print + the current filename you are working on. Each of the filenames can reuse a buffer unless the size of the name is longer than the previous name.
-
\$\begingroup\$ Just noticed that @Toby flagged the storage problem I just pointed out. Oops \$\endgroup\$boatcoder– boatcoder2022年12月26日 13:38:21 +00:00Commented Dec 26, 2022 at 13:38
-
1\$\begingroup\$ One - this was a good learning experience for the poster. Two using your logic why even
malloc()
the memory, an array declared in the code would be good enough. You might want to explain why memory allocation is expensive (occasional system calls). Plus one because it is a decent answer. You might want to read codereview.meta.stackexchange.com/questions/5777/… \$\endgroup\$2022年12月26日 14:41:46 +00:00Commented Dec 26, 2022 at 14:41
Please critique not only my use of dynamic memory allocation,
No need for any dynamic memory allocation.
Top level code only needs
#deifne TOP_N 10
struct FS_Info Top[TOP_N];
char path_to_entry[1024];
Magic numbers
Why 1024
? Typical file systems offset some MAX_PATH_NAME
or the like.
Error Checking
Code like snprintf(path_to_entry, sizeof(path_to_entry), "%s/%s", currDir, entry->d_name);
does not detect an insufficient buffer.
Weak file size
File sizes can readily exceed INT_MAX
. Use the same type in struct stat st
or a wider type than int
.
// int size;
long long size;