Simple library for an undocumented archive format

Question 1

I'm writing a little library for some undocumented file formats. The original program uses a path string to find files in an archive and this function tries to do the same.

The archive contains a magic string, an offset into the file list, the data in the files and the file list itself. A entry in the filelist consists of a path (256 bytes long, 0-padded) and three numbers, the second one being the offset from the archive start to the entry file. The entries are sorted alphabetically respecting case; here are some mockup paths to demonstrate the ordering:

Entry/Graphics/Sprite.cgf
Entry/Sound/Ding.wav
Quit/Graphics/exit.cgf
_RAW8123

Here are the relevant utility functions and defines. These are defined in Utils.h and Utils.c, I've put them here together for context.

#define BYTESZE sizeof(char)
typedef unsigned int uint;
typedef unsigned char byte;
uint readU4L(FILE* file) {
 byte buf[4];
 fread(buf, BYTESZE, 4, file);
 return buf[3] << 24 | buf[2] << 16 | buf[1] << 8 | buf[0];
}

Finally, here's the "library":

// ====== BIGfile.h
#define BIG_MAGIC_LEN 7
#define BIG_SEEK_SUCC 1
#define BIG_SEEK_FAIL 0
typedef struct BIGFile {
 char magic[BIG_MAGIC_LEN];
 uint offset;
 FILE* read;
} BIGFile;
BIGFile* load_BIGFile(char* fpath);
int seekToEntry(BIGFile* bigfile, char* path);

// ====== BIGFile.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "Utils.h"
#include "BIGFile.h"
BIGFile* load_BIGFile(char* fpath) {
 BIGFile* bigfile = malloc(sizeof(BIGFile));
 fopen_s(&(bigfile->read), fpath, "rb");
 if (!bigfile->read) {
 printf("Can't open file.\n");
 return 0;
 }
 fread(bigfile->magic, BYTESZE, BIG_MAGIC_LEN, bigfile->read);
 bigfile->offset = readU4L(bigfile->read);
 return bigfile;
}
int seekToEntry(BIGFile* bigfile, char* path) {
 fseek(bigfile->read, bigfile->offset, SEEK_SET);
 
 byte buf[256];
 int pos = 0;
 // while not EOF, read string part
 while (fread(buf, BYTESZE, 256, bigfile->read)) {
 
 // seek over entry data
 fseek(bigfile->read, 12, SEEK_CUR);
 
 // loop until given path != read path
 while (path[pos]==buf[pos]) {
 pos++;
 // path has ended, so buf == path
 if (!path[pos]) {
 // seek back into entry data
 fseek(bigfile->read, -8, SEEK_CUR);
 // read unsigned little endian 4-byte int
 uint res = readU4L(bigfile->read);
 // seek there and return success
 fseek(bigfile->read, res, SEEK_SET);
 //printf("found at %d\n", res);
 return BIG_SEEK_SUCC;
 }
 }
 // path char is smaller than buf char
 // we seeked past where the entry could bem return failure
 if (path[pos]<buf[pos]) {
 return BIG_SEEK_FAIL;
 }
 }
 // EOF, return failure
 return BIG_SEEK_FAIL;
}

I'd like some feedback on general style, as well as on the seekToEntry function, perhaps this can be improved somehow? Thanks in advance!

Question 2

After implementing the suggested changes, how would I go about getting the new version reviewed? Do I post a new question or edit this one?

Question 3

Overall, your coding style is pretty clean, but here are some improvement suggestions to make the code more robust and portable, as well as safer:

Use fixed sized integers

In C, sizeof(char) is guaranteed to be 1 by the standard, but sizeof(unsigned int) is not guaranteed to be 4.

Fortunately, C99 introduces fixed size integers such as uint32_t that is guaranteed to be 32 bit (4 bytes) defined in stdint.h.
Since you are using non standard fopen_s I assume you are not required to stick to ANSI C.

Add error checking

In the code you posted, you never check the return value of your file read operations or memory allocation operations.

You also never check the sanity of the data you read.

For a library, robust error handling is important, so you should always check if read operations success, and if data such as entry sizes or offsets are reasonable - match expected size / do not go past end of file.

On error, you should return one of several error codes you define for you library and list in documentation.

Reduce disk operations

You will probably not feel this in your testing, both due to modern hardware, small file size, and multiple caches on different levels, but your seekToEntry function is inefficient due to many back and forward disk operations (both seek and read).

There are two approaches here, which you can choose depending on how you want to trade memory for speed, and how would your library be used:

Preload the entire "catalog" of file entries in to array of structures in memory and search that when you need to find entry position in file.
Since your entries are guaranteed to be sorted, you may want to implement a more efficient search, maybe arrange them in a tree by prefix.
Since every entry has fixed size, read at least the current entry completely in to a structure.
If you don't want to load them all to memory in advance, at least load all data (name, size, offset) for each entry as you iterate, and save your self a bunch of seek operations.

Either approach will also give you the opportunity to create a documented structure for "entry" of this archive file, which will be good for you and your library users.

Create a name space

Since this is a library, it may be used in a big project with other libraries or many functions defined in the project it self.

You want to avoid name collision for your functions, structures, and defines.

Though C does not support special syntax for namespces, one convention for libraries is to use a fixed and consistent prefix to names followed by an underscore.

If your library is for handling "Big File Archive" you could prefix all your names with BFA_ like so: BFA_loadFile, BFA_seekToEntry.

Avoid non-standard functions

While MS may claim fopen_s is more secure, it is also non-standard, so if you or one of your library users ever switches compiler or OS, your library will no longer work.

You should use C standard fopen instead and just add proper error checking as specified above.

Also, there is no reason to reimplement strncmp as you do with your loops.
Unlike strcmp which is not safe if the string is not guaranteed to be null terminated, strncmp is perfectly safe for your use case where maximum length is known in advance.

Use consistent interface

It is hard to tell with only two functions, but you seem to have some difference in how your library functions work:

One returns an "object" (the BIGFile structure), another returns error / success codes instead.

While this is technically valid approach, it would be more consistent and comfortable for the user to have a unified interface where all functions would return the same thing - error codes. If a function must also return a value for the user, such as a pointer to a structure, it should return it using one of the parameters.

Your open function could look like this:

/**
 * Opens an archive file and loads its header
 *
 * @param fpath path to the archive file
 * @param header place to return loaded header data if the file is successfully loaded (structure will be allocated by the function)
 * @return success / error status code
 */
int BFA_loadFile(char *fpath, BFA_FileHeader **header) {
 ...
}

Here, I also included one common method of documenting functions in libraries which can be used with automated tools such as doxygen to produce documentation files in HTML and other formats.

Question 4

Thanks for your time! I'm not sure what you mean with the fixed size ints, I did change the sizeof(char) to just 1 though. I'm also not sure how I would go about reducing disk operations. Keeping a >1000 entry catalogue in memory seems a bit suboptimal, but how would I get the data from the 256+12 byte long blob? About the interface, that's consistent with the rest of the library. Is it acceptable for load_BIGFile to set the errno or is it enough to just return NULL on failure? I could also add a BIGFile* arg and return a status code, but that seems a bit weird to me.

Question 5

@mindoverflow 1. In C, int is not guaranteed to be 4 bytes, but uint32_t is guaranteed to be 32 bits which is 4 bytes. That is what I mean by "fixed size". 2. By modern standard 260K of memory (which is roughly what you need for 1000 entries) is nothing, and is much better than doing 2 - 3 seek operations for every entry. But, if it is still too much for you, at leas read the whole entry before checking, instead of reading the name than going back to read size / offset it it matches. Coding is always a trade off, and with today's hardware, trading RAM for speed can be good!

Question 6

@mindoverflow 3. It is up to you how detailed the error reporting of your library should be. You can just return NULL on error, or you can choose to help the user of the library and try to give them more info on why it failed via errno. To decide, ask your self one simple question: Would knowing why the function failed help the library user to fix the issue?

Question 7

Strictly speaking, uint32_t is only guaranteed to be 32 bits if it is defined. And it's certainly not guaranteed to be 4 bytes (= chars).

Question 8

@TobySpeight in the OP's case, he is using these vars to read binary values form a file, so we are talking bytes as general term not chars (or some BYTE style def). And as far as I know, byte today is always defined as 8 bits. For 'chars', it seems they may be larger, but not less then 8 bits, and that is super rare: stackoverflow.com/questions/5516044/system-where-1-byte-8-bit stackoverflow.com/questions/2098149/… If I am missing something, I would be glad to read some reference on this (C and sizes can be problematic).

Question 9

There's no include guards in the header. You'll want to fix that immediately.

#define BYTESZE sizeof(char)

That's a pretty pointless definition, since sizeof yields the number of char needed for its argument, meaning sizeof (char) is 1 by definition.

uint readU4L(FILE* file) {

FILE isn't defined - looks like you're missing an include of <stdio.h> from this header.

 fread(buf, BYTESZE, 4, file);

Why are we ignoring the return value from fread()? If it's not successful, we'll have unitialised values in buf, meaning our program has undefined behaviour.

 return buf[3] << 24 | buf[2] << 16 | buf[1] << 8 | buf[0];

There's a subtle type conversion problem here. The unsigned char values from buf get promoted to int for the arithmetic, but then converted to unsigned int for the return value. I think we actually want to treat them as unsigned values throughout, so we should convert them to unsigned int before the arithmetic.

Note also that << 24 may be outside the valid range of int. Perhaps consider using uint32_t for this value? (Include <stdint.h> to define it).

int seekToEntry(BIGFile* bigfile, char* path);

Do we need to be able to write to *path? Judging from the name, that should be an input, and passed as const char*.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "Utils.h"
#include "BIGFile.h"

It's best to make "BIGFile.h" the first thing that you include from BIGFile.c - that helps spot problems such as the missing include noted earlier.

<string.h> is included but not needed.

 BIGFile* bigfile = malloc(sizeof(BIGFile));

What do we do if bigfile is a null pointer? It seems we go ahead and use it anyway (more Undefined Behaviour).

A good habit is to apply sizeof to the destination rather than its type; this aids readers when the declaration is far away. For example, in the following case it's much more obviously a correct match:

bigfile = malloc(sizeof *bigfile);

fopen_s(&(bigfile->read), fpath, "rb");

What's the point of using Annex K fopen_s() function if we just ignore the return value? Also, I don't see where we check that __STDC_LIB_EXT1__ is defined nor where we set __STDC_WANT_LIB_EXT1__ to ensure that the function is provided. It seems we'd be better off using plain old fopen() and checking for a null pointer in the normal way.

if (!bigfile->read) {
 printf("Can't open file.\n");
 return 0;
}

At last, we see some error checking. There's a couple of things wrong here:

We're writing to stdout instead of stderr.
We neglected to free() the memory we (possibly) allocated with malloc(), giving a memory leak.

fread(bigfile->magic, BYTESZE, BIG_MAGIC_LEN, bigfile->read);

Again, there's no check that we successfully read as much as we wanted here. And shouldn't we be checking that the file magic matches the expected value, so we don't attempt to parse files not in our expected format?

fseek(bigfile->read, bigfile->offset, SEEK_SET);

fseek() can fail too, you know - there's nothing to say at this stage that we're not reading from a pipe, for example.

byte buf[256];
while (fread(buf, BYTESZE, 256, bigfile->read)) {

Instead of writing that magic number twice (and risking missing one when modifying it), we can automatically use the available size: fread(buf, 1, sizeof buf, bigfile->read). Again, we're ignoring the vitally-important return value.

 while (path[pos]==buf[pos]) {
 pos++;

It looks like we're doing some kind of strcmp() here - why not use the standard function? It certainly looks wrong that we don't initialize pos to 0 before starting this comparison - if that's really intended, it warrants a comment to warn readers.

Question 10

Thanks! I changed sizeof(char) to 1 shortly after posting. How would I do the conversion correctly? Is (uint) buf[3] << 24 | (uint) buf[2] << 16 | (uint) buf[1] << 8 | (uint) buf[0]; better?

Question 11

That looks correct. There might be a more eloquent way to write it, but can't think immediately what that might be.

Lev M. Lev M. 1,0816 silver badges8 bronze badges · Answer 1 · 2021-02-27 13:18:07Z

Overall, your coding style is pretty clean, but here are some improvement suggestions to make the code more robust and portable, as well as safer:

Use fixed sized integers

In C, sizeof(char) is guaranteed to be 1 by the standard, but sizeof(unsigned int) is not guaranteed to be 4.

Fortunately, C99 introduces fixed size integers such as uint32_t that is guaranteed to be 32 bit (4 bytes) defined in stdint.h.
Since you are using non standard fopen_s I assume you are not required to stick to ANSI C.

Add error checking

In the code you posted, you never check the return value of your file read operations or memory allocation operations.

You also never check the sanity of the data you read.

For a library, robust error handling is important, so you should always check if read operations success, and if data such as entry sizes or offsets are reasonable - match expected size / do not go past end of file.

On error, you should return one of several error codes you define for you library and list in documentation.

Reduce disk operations

You will probably not feel this in your testing, both due to modern hardware, small file size, and multiple caches on different levels, but your seekToEntry function is inefficient due to many back and forward disk operations (both seek and read).

There are two approaches here, which you can choose depending on how you want to trade memory for speed, and how would your library be used:

Preload the entire "catalog" of file entries in to array of structures in memory and search that when you need to find entry position in file.
Since your entries are guaranteed to be sorted, you may want to implement a more efficient search, maybe arrange them in a tree by prefix.
Since every entry has fixed size, read at least the current entry completely in to a structure.
If you don't want to load them all to memory in advance, at least load all data (name, size, offset) for each entry as you iterate, and save your self a bunch of seek operations.

Either approach will also give you the opportunity to create a documented structure for "entry" of this archive file, which will be good for you and your library users.

Create a name space

Since this is a library, it may be used in a big project with other libraries or many functions defined in the project it self.

You want to avoid name collision for your functions, structures, and defines.

Though C does not support special syntax for namespces, one convention for libraries is to use a fixed and consistent prefix to names followed by an underscore.

If your library is for handling "Big File Archive" you could prefix all your names with BFA_ like so: BFA_loadFile, BFA_seekToEntry.

Avoid non-standard functions

While MS may claim fopen_s is more secure, it is also non-standard, so if you or one of your library users ever switches compiler or OS, your library will no longer work.

You should use C standard fopen instead and just add proper error checking as specified above.

Also, there is no reason to reimplement strncmp as you do with your loops.
Unlike strcmp which is not safe if the string is not guaranteed to be null terminated, strncmp is perfectly safe for your use case where maximum length is known in advance.

Use consistent interface

It is hard to tell with only two functions, but you seem to have some difference in how your library functions work:

One returns an "object" (the BIGFile structure), another returns error / success codes instead.

While this is technically valid approach, it would be more consistent and comfortable for the user to have a unified interface where all functions would return the same thing - error codes. If a function must also return a value for the user, such as a pointer to a structure, it should return it using one of the parameters.

Your open function could look like this:

/**
 * Opens an archive file and loads its header
 *
 * @param fpath path to the archive file
 * @param header place to return loaded header data if the file is successfully loaded (structure will be allocated by the function)
 * @return success / error status code
 */
int BFA_loadFile(char *fpath, BFA_FileHeader **header) {
 ...
}

Here, I also included one common method of documenting functions in libraries which can be used with automated tools such as doxygen to produce documentation files in HTML and other formats.

Thanks for your time! I'm not sure what you mean with the fixed size ints, I did change the sizeof(char) to just 1 though. I'm also not sure how I would go about reducing disk operations. Keeping a >1000 entry catalogue in memory seems a bit suboptimal, but how would I get the data from the 256+12 byte long blob? About the interface, that's consistent with the rest of the library. Is it acceptable for load_BIGFile to set the errno or is it enough to just return NULL on failure? I could also add a BIGFile* arg and return a status code, but that seems a bit weird to me.
@mindoverflow 1. In C, int is not guaranteed to be 4 bytes, but uint32_t is guaranteed to be 32 bits which is 4 bytes. That is what I mean by "fixed size". 2. By modern standard 260K of memory (which is roughly what you need for 1000 entries) is nothing, and is much better than doing 2 - 3 seek operations for every entry. But, if it is still too much for you, at leas read the whole entry before checking, instead of reading the name than going back to read size / offset it it matches. Coding is always a trade off, and with today's hardware, trading RAM for speed can be good!
@mindoverflow 3. It is up to you how detailed the error reporting of your library should be. You can just return NULL on error, or you can choose to help the user of the library and try to give them more info on why it failed via errno. To decide, ask your self one simple question: Would knowing why the function failed help the library user to fix the issue?
Strictly speaking, uint32_t is only guaranteed to be 32 bits if it is defined. And it's certainly not guaranteed to be 4 bytes (= chars).
@TobySpeight in the OP's case, he is using these vars to read binary values form a file, so we are talking bytes as general term not chars (or some BYTE style def). And as far as I know, byte today is always defined as 8 bits. For 'chars', it seems they may be larger, but not less then 8 bits, and that is super rare: stackoverflow.com/questions/5516044/system-where-1-byte-8-bit stackoverflow.com/questions/2098149/… If I am missing something, I would be glad to read some reference on this (C and sizes can be problematic).

Toby Speight Toby Speight 87.9k14 gold badges104 silver badges325 bronze badges · Answer 2 · 2021-02-27 12:51:30Z

There's no include guards in the header. You'll want to fix that immediately.

#define BYTESZE sizeof(char)

That's a pretty pointless definition, since sizeof yields the number of char needed for its argument, meaning sizeof (char) is 1 by definition.

uint readU4L(FILE* file) {

FILE isn't defined - looks like you're missing an include of <stdio.h> from this header.

 fread(buf, BYTESZE, 4, file);

Why are we ignoring the return value from fread()? If it's not successful, we'll have unitialised values in buf, meaning our program has undefined behaviour.

 return buf[3] << 24 | buf[2] << 16 | buf[1] << 8 | buf[0];

There's a subtle type conversion problem here. The unsigned char values from buf get promoted to int for the arithmetic, but then converted to unsigned int for the return value. I think we actually want to treat them as unsigned values throughout, so we should convert them to unsigned int before the arithmetic.

Note also that << 24 may be outside the valid range of int. Perhaps consider using uint32_t for this value? (Include <stdint.h> to define it).

int seekToEntry(BIGFile* bigfile, char* path);

Do we need to be able to write to *path? Judging from the name, that should be an input, and passed as const char*.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "Utils.h"
#include "BIGFile.h"

It's best to make "BIGFile.h" the first thing that you include from BIGFile.c - that helps spot problems such as the missing include noted earlier.

<string.h> is included but not needed.

 BIGFile* bigfile = malloc(sizeof(BIGFile));

What do we do if bigfile is a null pointer? It seems we go ahead and use it anyway (more Undefined Behaviour).

A good habit is to apply sizeof to the destination rather than its type; this aids readers when the declaration is far away. For example, in the following case it's much more obviously a correct match:

bigfile = malloc(sizeof *bigfile);

fopen_s(&(bigfile->read), fpath, "rb");

What's the point of using Annex K fopen_s() function if we just ignore the return value? Also, I don't see where we check that __STDC_LIB_EXT1__ is defined nor where we set __STDC_WANT_LIB_EXT1__ to ensure that the function is provided. It seems we'd be better off using plain old fopen() and checking for a null pointer in the normal way.

if (!bigfile->read) {
 printf("Can't open file.\n");
 return 0;
}

At last, we see some error checking. There's a couple of things wrong here:

We're writing to stdout instead of stderr.
We neglected to free() the memory we (possibly) allocated with malloc(), giving a memory leak.

fread(bigfile->magic, BYTESZE, BIG_MAGIC_LEN, bigfile->read);

Again, there's no check that we successfully read as much as we wanted here. And shouldn't we be checking that the file magic matches the expected value, so we don't attempt to parse files not in our expected format?

fseek(bigfile->read, bigfile->offset, SEEK_SET);

fseek() can fail too, you know - there's nothing to say at this stage that we're not reading from a pipe, for example.

byte buf[256];
while (fread(buf, BYTESZE, 256, bigfile->read)) {

Instead of writing that magic number twice (and risking missing one when modifying it), we can automatically use the available size: fread(buf, 1, sizeof buf, bigfile->read). Again, we're ignoring the vitally-important return value.

 while (path[pos]==buf[pos]) {
 pos++;

It looks like we're doing some kind of strcmp() here - why not use the standard function? It certainly looks wrong that we don't initialize pos to 0 before starting this comparison - if that's really intended, it warrants a comment to warn readers.

Thanks! I changed sizeof(char) to 1 shortly after posting. How would I do the conversion correctly? Is (uint) buf[3] << 24 | (uint) buf[2] << 16 | (uint) buf[1] << 8 | (uint) buf[0]; better?
That looks correct. There might be a more eloquent way to write it, but can't think immediately what that might be.

Stack Exchange Network

Simple library for an undocumented archive format

2 Answers 2

Use fixed sized integers

Add error checking

Reduce disk operations

Create a name space

Avoid non-standard functions

Use consistent interface

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Simple library for an undocumented archive format

2 Answers 2

Use fixed sized integers

Add error checking

Reduce disk operations

Create a name space

Avoid non-standard functions

Use consistent interface

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions