I'm writing a little library for some undocumented file formats. The original program uses a path string to find files in an archive and this function tries to do the same.
The archive contains a magic string, an offset into the file list, the data in the files and the file list itself. A entry in the filelist consists of a path (256 bytes long, 0-padded) and three numbers, the second one being the offset from the archive start to the entry file. The entries are sorted alphabetically respecting case; here are some mockup paths to demonstrate the ordering:
Entry/Graphics/Sprite.cgf
Entry/Sound/Ding.wav
Quit/Graphics/exit.cgf
_RAW8123
Here are the relevant utility functions and defines. These are defined in Utils.h
and Utils.c
, I've put them here together for context.
#define BYTESZE sizeof(char)
typedef unsigned int uint;
typedef unsigned char byte;
uint readU4L(FILE* file) {
byte buf[4];
fread(buf, BYTESZE, 4, file);
return buf[3] << 24 | buf[2] << 16 | buf[1] << 8 | buf[0];
}
Finally, here's the "library":
// ====== BIGfile.h
#define BIG_MAGIC_LEN 7
#define BIG_SEEK_SUCC 1
#define BIG_SEEK_FAIL 0
typedef struct BIGFile {
char magic[BIG_MAGIC_LEN];
uint offset;
FILE* read;
} BIGFile;
BIGFile* load_BIGFile(char* fpath);
int seekToEntry(BIGFile* bigfile, char* path);
// ====== BIGFile.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "Utils.h"
#include "BIGFile.h"
BIGFile* load_BIGFile(char* fpath) {
BIGFile* bigfile = malloc(sizeof(BIGFile));
fopen_s(&(bigfile->read), fpath, "rb");
if (!bigfile->read) {
printf("Can't open file.\n");
return 0;
}
fread(bigfile->magic, BYTESZE, BIG_MAGIC_LEN, bigfile->read);
bigfile->offset = readU4L(bigfile->read);
return bigfile;
}
int seekToEntry(BIGFile* bigfile, char* path) {
fseek(bigfile->read, bigfile->offset, SEEK_SET);
byte buf[256];
int pos = 0;
// while not EOF, read string part
while (fread(buf, BYTESZE, 256, bigfile->read)) {
// seek over entry data
fseek(bigfile->read, 12, SEEK_CUR);
// loop until given path != read path
while (path[pos]==buf[pos]) {
pos++;
// path has ended, so buf == path
if (!path[pos]) {
// seek back into entry data
fseek(bigfile->read, -8, SEEK_CUR);
// read unsigned little endian 4-byte int
uint res = readU4L(bigfile->read);
// seek there and return success
fseek(bigfile->read, res, SEEK_SET);
//printf("found at %d\n", res);
return BIG_SEEK_SUCC;
}
}
// path char is smaller than buf char
// we seeked past where the entry could bem return failure
if (path[pos]<buf[pos]) {
return BIG_SEEK_FAIL;
}
}
// EOF, return failure
return BIG_SEEK_FAIL;
}
I'd like some feedback on general style, as well as on the seekToEntry
function, perhaps this can be improved somehow? Thanks in advance!
-
\$\begingroup\$ After implementing the suggested changes, how would I go about getting the new version reviewed? Do I post a new question or edit this one? \$\endgroup\$mindoverflow– mindoverflow2021年03月02日 18:07:40 +00:00Commented Mar 2, 2021 at 18:07
2 Answers 2
Overall, your coding style is pretty clean, but here are some improvement suggestions to make the code more robust and portable, as well as safer:
Use fixed sized integers
In C, sizeof(char)
is guaranteed to be 1 by the standard, but sizeof(unsigned int)
is not guaranteed to be 4.
Fortunately, C99 introduces fixed size integers such as uint32_t
that is guaranteed to be 32 bit (4 bytes) defined in stdint.h.
Since you are using non standard fopen_s
I assume you are not required to stick to ANSI C.
Add error checking
In the code you posted, you never check the return value of your file read operations or memory allocation operations.
You also never check the sanity of the data you read.
For a library, robust error handling is important, so you should always check if read operations success, and if data such as entry sizes or offsets are reasonable - match expected size / do not go past end of file.
On error, you should return one of several error codes you define for you library and list in documentation.
Reduce disk operations
You will probably not feel this in your testing, both due to modern hardware, small file size, and multiple caches on different levels, but your seekToEntry
function is inefficient due to many back and forward disk operations (both seek and read).
There are two approaches here, which you can choose depending on how you want to trade memory for speed, and how would your library be used:
Preload the entire "catalog" of file entries in to array of structures in memory and search that when you need to find entry position in file.
Since your entries are guaranteed to be sorted, you may want to implement a more efficient search, maybe arrange them in a tree by prefix.Since every entry has fixed size, read at least the current entry completely in to a structure.
If you don't want to load them all to memory in advance, at least load all data (name, size, offset) for each entry as you iterate, and save your self a bunch ofseek
operations.
Either approach will also give you the opportunity to create a documented structure for "entry" of this archive file, which will be good for you and your library users.
Create a name space
Since this is a library, it may be used in a big project with other libraries or many functions defined in the project it self.
You want to avoid name collision for your functions, structures, and defines.
Though C does not support special syntax for namespces, one convention for libraries is to use a fixed and consistent prefix to names followed by an underscore.
If your library is for handling "Big File Archive" you could prefix all your names with BFA_ like so: BFA_loadFile
, BFA_seekToEntry
.
Avoid non-standard functions
While MS may claim fopen_s
is more secure, it is also non-standard, so if you or one of your library users ever switches compiler or OS, your library will no longer work.
You should use C standard fopen
instead and just add proper error checking as specified above.
Also, there is no reason to reimplement strncmp
as you do with your loops.
Unlike strcmp
which is not safe if the string is not guaranteed to be null terminated, strncmp
is perfectly safe for your use case where maximum length is known in advance.
Use consistent interface
It is hard to tell with only two functions, but you seem to have some difference in how your library functions work:
One returns an "object" (the BIGFile
structure), another returns error / success codes instead.
While this is technically valid approach, it would be more consistent and comfortable for the user to have a unified interface where all functions would return the same thing - error codes. If a function must also return a value for the user, such as a pointer to a structure, it should return it using one of the parameters.
Your open function could look like this:
/**
* Opens an archive file and loads its header
*
* @param fpath path to the archive file
* @param header place to return loaded header data if the file is successfully loaded (structure will be allocated by the function)
* @return success / error status code
*/
int BFA_loadFile(char *fpath, BFA_FileHeader **header) {
...
}
Here, I also included one common method of documenting functions in libraries which can be used with automated tools such as doxygen to produce documentation files in HTML and other formats.
-
\$\begingroup\$ Thanks for your time! I'm not sure what you mean with the fixed size ints, I did change the
sizeof(char)
to just1
though. I'm also not sure how I would go about reducing disk operations. Keeping a >1000 entry catalogue in memory seems a bit suboptimal, but how would I get the data from the 256+12 byte long blob? About the interface, that's consistent with the rest of the library. Is it acceptable forload_BIGFile
to set theerrno
or is it enough to just returnNULL
on failure? I could also add aBIGFile*
arg and return a status code, but that seems a bit weird to me. \$\endgroup\$mindoverflow– mindoverflow2021年03月02日 17:55:53 +00:00Commented Mar 2, 2021 at 17:55 -
\$\begingroup\$ @mindoverflow 1. In C,
int
is not guaranteed to be 4 bytes, butuint32_t
is guaranteed to be 32 bits which is 4 bytes. That is what I mean by "fixed size". 2. By modern standard 260K of memory (which is roughly what you need for 1000 entries) is nothing, and is much better than doing 2 - 3 seek operations for every entry. But, if it is still too much for you, at leas read the whole entry before checking, instead of reading the name than going back to read size / offset it it matches. Coding is always a trade off, and with today's hardware, trading RAM for speed can be good! \$\endgroup\$Lev M.– Lev M.2021年03月02日 19:11:58 +00:00Commented Mar 2, 2021 at 19:11 -
\$\begingroup\$ @mindoverflow 3. It is up to you how detailed the error reporting of your library should be. You can just return NULL on error, or you can choose to help the user of the library and try to give them more info on why it failed via
errno
. To decide, ask your self one simple question: Would knowing why the function failed help the library user to fix the issue? \$\endgroup\$Lev M.– Lev M.2021年03月02日 19:16:52 +00:00Commented Mar 2, 2021 at 19:16 -
\$\begingroup\$ Strictly speaking,
uint32_t
is only guaranteed to be 32 bits if it is defined. And it's certainly not guaranteed to be 4 bytes (= chars). \$\endgroup\$Toby Speight– Toby Speight2021年03月02日 20:01:45 +00:00Commented Mar 2, 2021 at 20:01 -
\$\begingroup\$ @TobySpeight in the OP's case, he is using these vars to read binary values form a file, so we are talking bytes as general term not chars (or some
BYTE
style def). And as far as I know, byte today is always defined as 8 bits. For 'chars', it seems they may be larger, but not less then 8 bits, and that is super rare: stackoverflow.com/questions/5516044/system-where-1-byte-8-bit stackoverflow.com/questions/2098149/… If I am missing something, I would be glad to read some reference on this (C and sizes can be problematic). \$\endgroup\$Lev M.– Lev M.2021年03月02日 20:42:01 +00:00Commented Mar 2, 2021 at 20:42
There's no include guards in the header. You'll want to fix that immediately.
#define BYTESZE sizeof(char)
That's a pretty pointless definition, since sizeof
yields the number of char
needed for its argument, meaning sizeof (char)
is 1 by definition.
uint readU4L(FILE* file) {
FILE
isn't defined - looks like you're missing an include of <stdio.h>
from this header.
fread(buf, BYTESZE, 4, file);
Why are we ignoring the return value from fread()
? If it's not successful, we'll have unitialised values in buf, meaning our program has undefined behaviour.
return buf[3] << 24 | buf[2] << 16 | buf[1] << 8 | buf[0];
There's a subtle type conversion problem here. The unsigned char
values from buf
get promoted to int
for the arithmetic, but then converted to unsigned int
for the return value. I think we actually want to treat them as unsigned values throughout, so we should convert them to unsigned int
before the arithmetic.
Note also that << 24
may be outside the valid range of int
. Perhaps consider using uint32_t
for this value? (Include <stdint.h>
to define it).
int seekToEntry(BIGFile* bigfile, char* path);
Do we need to be able to write to *path
? Judging from the name, that should be an input, and passed as const char*
.
#include <stdio.h> #include <stdlib.h> #include <string.h> #include "Utils.h" #include "BIGFile.h"
It's best to make "BIGFile.h"
the first thing that you include from BIGFile.c
- that helps spot problems such as the missing include noted earlier.
<string.h>
is included but not needed.
BIGFile* bigfile = malloc(sizeof(BIGFile));
What do we do if bigfile
is a null pointer? It seems we go ahead and use it anyway (more Undefined Behaviour).
A good habit is to apply sizeof
to the destination rather than its type; this aids readers when the declaration is far away. For example, in the following case it's much more obviously a correct match:
bigfile = malloc(sizeof *bigfile);
fopen_s(&(bigfile->read), fpath, "rb");
What's the point of using Annex K fopen_s()
function if we just ignore the return value? Also, I don't see where we check that __STDC_LIB_EXT1__
is defined nor where we set __STDC_WANT_LIB_EXT1__
to ensure that the function is provided. It seems we'd be better off using plain old fopen()
and checking for a null pointer in the normal way.
if (!bigfile->read) { printf("Can't open file.\n"); return 0; }
At last, we see some error checking. There's a couple of things wrong here:
- We're writing to
stdout
instead ofstderr
. - We neglected to
free()
the memory we (possibly) allocated withmalloc()
, giving a memory leak.
fread(bigfile->magic, BYTESZE, BIG_MAGIC_LEN, bigfile->read);
Again, there's no check that we successfully read as much as we wanted here. And shouldn't we be checking that the file magic matches the expected value, so we don't attempt to parse files not in our expected format?
fseek(bigfile->read, bigfile->offset, SEEK_SET);
fseek()
can fail too, you know - there's nothing to say at this stage that we're not reading from a pipe, for example.
byte buf[256]; while (fread(buf, BYTESZE, 256, bigfile->read)) {
Instead of writing that magic number twice (and risking missing one when modifying it), we can automatically use the available size: fread(buf, 1, sizeof buf, bigfile->read)
. Again, we're ignoring the vitally-important return value.
while (path[pos]==buf[pos]) { pos++;
It looks like we're doing some kind of strcmp()
here - why not use the standard function? It certainly looks wrong that we don't initialize pos
to 0 before starting this comparison - if that's really intended, it warrants a comment to warn readers.
-
\$\begingroup\$ Thanks! I changed
sizeof(char)
to1
shortly after posting. How would I do the conversion correctly? Is(uint) buf[3] << 24 | (uint) buf[2] << 16 | (uint) buf[1] << 8 | (uint) buf[0];
better? \$\endgroup\$mindoverflow– mindoverflow2021年03月02日 18:06:41 +00:00Commented Mar 2, 2021 at 18:06 -
\$\begingroup\$ That looks correct. There might be a more eloquent way to write it, but can't think immediately what that might be. \$\endgroup\$Toby Speight– Toby Speight2021年03月02日 19:59:47 +00:00Commented Mar 2, 2021 at 19:59