Comparing two large binary files in C

Question 1

I'm trying to read two sufficiently large binary files, comparing them and printing the offset at which they differ. I'm using fread to read the binary files and memcmp to compare them.

#include<stdio.h>
#include<time.h>
clock_t start, end;
int main(int argc, char *argv[])
{
 double cpu_time_taken;
 FILE *fp1, *fp2;
 printf("\nArgument count: %d", argc);
 printf("\nFile 1 is: %s", argv[1]);
 printf("\nFile 2 is: %s\n", argv[2]);
 if (argc < 3)
 {
 printf("\nInsufficient Arguments: \n");
 printf("\nHelp:./executable <filename1> <filename2>\n");
 return 0;
 }
 else
 {
 fp1 = fopen(argv[1], "rb");
 if (fp1 == NULL)
 {
 printf("\nError in opening file %s", argv[1]);
 return 0;
 }
 fp2 = fopen(argv[2], "rb");
 if (fp2 == NULL)
 {
 printf("\nError in opening file %s", argv[2]);
 return 0;
 }
 if ((fp1 != NULL) && (fp2 != NULL))
 {
 start = clock();
 compare_two_binary_files(fp1, fp2);
 end = clock();
 cpu_time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
 printf("\nTime taken to compare: %f", cpu_time_taken*1000);
 }
 }
}
int compare_two_binary_files(FILE *fp1, FILE *fp2)
{
 char tmp1[16], tmp2[16];
 size_t bytes = 0, readsz = sizeof tmp1;
 int count = 0;
 while (!feof(fp1) || !feof(fp2)){
 fread (tmp1, sizeof *tmp1, readsz, fp1);
 fread (tmp2, sizeof *tmp2, readsz, fp2);
 count += 16;
 if(memcmp(tmp1, tmp2, readsz)){
 for(int i=0; i < readsz; i++){
 printf ("%d: 0x%02x ",i, tmp1[i]);
 }
 printf("\n%x", count);
 return 0;
 }
 }
}

Question 2

Why don't you mmap them: gnu.org/software/libc/manual/html_node/… ?

Question 3

Have you considered using a shell script to wrap the standard cmp command? If not, at least look at its source and use that as a basis for your version.

Question 4

Better printf ("%d: 0x%02x ", count+i, tmp1[i]); to have global offset of differences. With i you see bytes in current block. Also, use sizeof tmp1 instead of sizeof *tmp1 and count += sizeof tmp1.

Question 5

@N74 It is certainly worth considering mmap, but there are situations where mmap won't work. Not all operating systems support mmap, and on OS where it is supported it isn't all file systems which support it. Moreover mmap only works on regular files. And if you want to support files larger than your address space, the mmap solution isn't going to be much simpler.

Question 6

@N74 The most important reason is that there is no mmap in C.

Question 7

I'm trying to read two sufficiently large binary files, comparing them and printing the offset at which they differ.

compare_two_binary_files() is faulty for various reasons.

The offset in which they differ can readily exceed the int range. Recommend the widest integer available instead. File sizes are not limited by the integer types available, yet uintmax_t is a very reasonable assumption of being sufficient for a file size.
Checking feof() only ignores the possibility of a rare input error messing up the compare.
compare_two_binary_files() can return without a returning a value. A compiler warning should have occurred. Enable all warnings or use a better compiler.
Ignoring the return value of fread() is wrong.
if(memcmp(tmp1, tmp2, readsz)){ is questionable. The if() is true when the buffer differ.

Suggested alternative:

// Use something more generous than 16. Maybe even 4096 or 64k and allocate buffers
#define CMP_N 256
// Return value:
// 0: files compare equal in content and length, fp1 size saved as offset
// 1: files differ, fp1 longer, fp2 size saved as offset
// 2: files differ, fp2 longer, fp1 size saved as offset
// 3: files differ at offset
// -1: fp1 trouble reading. Unspecified data in offset
// -2: fp2 trouble reading. Unspecified data in offset
int compare_two_binary_files_alternate(FILE *fp1, FILE *fp2, uintmax_t *offset) {
 char tmp1[CMP_N], tmp2[CMP_N];
 size_t n1, n2;
 rewind(fp1); // start at beginning and clear error
 rewind(fp2);
 *offset = 0;
 do {
 n1 = fread(tmp1, sizeof *tmp1, sizeof tmp1 / sizeof *tmp1, fp1);
 if (n1 == 0 && ferror(fp1)) {
 return -1;
 }
 n2 = fread(tmp2, sizeof *tmp2, sizeof tmp2 / sizeof *tmp2, fp2);
 if (n2 == 0 && ferror(fp2)) {
 return -2;
 }
 size_t n_min = n1 < n2 ? n1 : n2;
 if (memcmp(tmp1, tmp2, n_min)) { // Quickly find if file contents differ ...
 for (size_t i = 0; i < n_min; i++) { // Slowly find where they differ
 if (tmp1[i] != tmp2[i]) {
 *offset += i;
 return 3;
 }
 }
 }
 *offset += n_min;
 if (n1 > n_min) {
 return 1;
 }
 if (n2 > n_min) {
 return 2;
 }
 } while (n1);
 return 0;
}

Question 8

Is off_t standard? It's intended for holding file offsets.

Question 9

@PeterCordes off_t is not part of the C standard. See stackoverflow.com/q/9073667/2410359 Perhaps you are thinking of fpos_t as used in fgetpos() and fsetpos()?

Question 10

And if you want to be a bit more efficient than "slowly" you could do a binary search for the difference. E.g. first check the first COMP_N / 2 bytes, if they differ check the first COMP_N / 4, etc. This will locate the difference in logarithmic rather than linear time which is useful especially with more generous block sizes like 4096 (on average requiring 6 steps instead of 2048) although I'm not sure what the effect of caching would do in the "slow" loop version.

Question 11

@CompuChip Your suggestion is still O(N) - yet perhaps with a linear improvement. N/2 1st iteration + N/4 2nd iteration + N/8 3rd + N/16 4th ... --> N. Given that files are large, efficiency of comparing the last block is likely minor. More significant speed improvements could be had with larger buffers, memory mapped I/O buffer, etc.

Question 12

fread (tmp1, sizeof *tmp1, readsz, fp1);
fread (tmp2, sizeof *tmp2, readsz, fp2);
count += 16;
if(memcmp(tmp1, tmp2, readsz)){
 ...
}

You are discarding the return values of the fread() calls, blindly assuming that they both successfully read 16 bytes.

It is unclear what the return value of compare_two_binary_files(..., ...) means. In fact, you sometimes don't return a value at all. Your compiler should have warned you about that problem.

File I/O should be done a block at a time: 512 bytes, 1024 bytes, or 2048 bytes would be a more reasonable chunk size than 16.

Technically, most of the time is spent waiting for I/O. cpu_time_taken is a misnomer: what you're measuring is called "wall clock time".

Question 13

compare_two_binary_files(..., ...) should just print the offset where the difference is found. It should not return any value, but my compiler didn't allow return so I just included an int return type. And could you elaborate on why cpu_time_taken is a misnomer and is there any right way I can measure the time taken to compare the binary files?

Question 14

If you have nothing to return, change the return type to void.

Question 15

4096 would be a good default size. 1 whole page on some common architectures like x86, and also 1 disk physical sector on modern rotational media, and many NAND flash devices. The kernel buffers I/O in normal hosted C implementations that run under a modern OS, but I wouldn't count on C stdio buffering input to do read() system calls in page-size chunks. (It might though. Or better, if stdio uses stat(2) to ask the OS for the optimal I/O block size.)

Question 16

If the files are "sufficiently large" - I was thinking Gbytes at a minimum, if not Tbytes - then the OP should look at the asynchronous I/O library.

Question 17

if ((fp1 != NULL) && (fp2 != NULL)) test is redundant. If one of them happened to be NULL, the program is already terminated.
Don't print error messages to the stdout. There is stderr for that purpose.
When printing to stdout, keep in mind that it is usually line buffered. The text stays in the internal buffer until a newline is printed. That's why it is important to print a newline after the message, not before.
printf("\nError in opening file %s", argv[1]); doesn't tell the most important part: why did fopen fail. Print strerror(errno) as well. Combining the above bullets,
```
 fprintf(stderr, "Error in opening file %s: %s\n", argv[1], strerror(errno));
```
You'd need to #include <errno.h> for that.

Question 18

"why it is important to print a newline after the message" is not necessarily sufficient. What are the rules of automatic flushing stdout buffer in C?. Consider fflush(stdout);.

Question 19

I like the idea of of informative error messages, yet printing a faulty file name is sometimes itself a problem. At a minimum, it is useful to include sententials to demarcate the iffy file name. file %s --> file \"%s\".

Question 20

@chux: that's part of why error messages should be printed on stderr, which is unbuffered by default. If stdout isn't going to a terminal, full buffering is normally fine. Unless you're writing a shell or something that prompts and waits for input, that someone might want to automate with pipes, you don't need fflush after printing a newline.

Question 21

@PeterCordes Agree about "is normally fine". Using fflush(stdout); always flushes regardless of implementation aspects of the character, line fully buffered mode or OS/shell properties. And it also does not oblige a change in functionality (adding a final '\n').

Question 22

@chux: My point was that for this case, it's clearly better to end the message with a newline and print it to stderr. Either of which would be sufficient for this case, unless abnormal termination that doesn't flush stdout is possible. (e.g. abort().) But both, as suggested by this answer, are better for formatting reasons as well as flushing correctness.

Question 23

Other answers focus nicely on the comparison function, so I'll just mention a couple things about the start of main.

What does this do when you run the program with no arguments?

printf("\nFile 1 is: %s", argv[1]);
printf("\nFile 2 is: %s\n", argv[2]);

You can't check argv[1] or argv[2] if there are not enough arguments. However, you do check that immediately after, so you can just cut-paste them a few lines down.

int main(int argc, char *argv[])
{
 double cpu_time_taken;
 FILE *fp1, *fp2;
 printf("\nArgument count: %d", argc);
 if (argc < 3)
 {
 printf("\nInsufficient Arguments: \n");
 printf("\nHelp:./executable <filename1> <filename2>\n");
 return 0;
 }
 else
 {
 printf("\nFile 1 is: %s", argv[1]);
 printf("\nFile 2 is: %s\n", argv[2]);
 fp1 = fopen(argv[1], "rb");
 //...
 }
}

You can simplify it further still - your else is unnecessary because the if terminates the program, so it won't run the content of the else regardless. I find it easier to read when there are fewer indentations, but maybe that's just me. Also, the if catches an error in running the program, and informs the user. If this is being called from a script, you'd want to return something other than 0, because 0 indicates success. So you can make one more simplification:

int main(int argc, char *argv[])
{
 double cpu_time_taken;
 FILE *fp1, *fp2;
 printf("\nArgument count: %d", argc);
 if (argc < 3)
 {
 printf("\nInsufficient Arguments: \n");
 printf("\nHelp:./executable <filename1> <filename2>\n");
 return 1;
 }
 printf("\nFile 1 is: %s", argv[1]);
 printf("\nFile 2 is: %s\n", argv[2]);
 fp1 = fopen(argv[1], "rb");
 //...
}

I echo the advice in other answers about putting the newline at the end of your print, rather than before.

chux chux 36.4k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2018-09-27 20:30:03Z

I'm trying to read two sufficiently large binary files, comparing them and printing the offset at which they differ.

compare_two_binary_files() is faulty for various reasons.

The offset in which they differ can readily exceed the int range. Recommend the widest integer available instead. File sizes are not limited by the integer types available, yet uintmax_t is a very reasonable assumption of being sufficient for a file size.
Checking feof() only ignores the possibility of a rare input error messing up the compare.
compare_two_binary_files() can return without a returning a value. A compiler warning should have occurred. Enable all warnings or use a better compiler.
Ignoring the return value of fread() is wrong.
if(memcmp(tmp1, tmp2, readsz)){ is questionable. The if() is true when the buffer differ.

Suggested alternative:

// Use something more generous than 16. Maybe even 4096 or 64k and allocate buffers
#define CMP_N 256
// Return value:
// 0: files compare equal in content and length, fp1 size saved as offset
// 1: files differ, fp1 longer, fp2 size saved as offset
// 2: files differ, fp2 longer, fp1 size saved as offset
// 3: files differ at offset
// -1: fp1 trouble reading. Unspecified data in offset
// -2: fp2 trouble reading. Unspecified data in offset
int compare_two_binary_files_alternate(FILE *fp1, FILE *fp2, uintmax_t *offset) {
 char tmp1[CMP_N], tmp2[CMP_N];
 size_t n1, n2;
 rewind(fp1); // start at beginning and clear error
 rewind(fp2);
 *offset = 0;
 do {
 n1 = fread(tmp1, sizeof *tmp1, sizeof tmp1 / sizeof *tmp1, fp1);
 if (n1 == 0 && ferror(fp1)) {
 return -1;
 }
 n2 = fread(tmp2, sizeof *tmp2, sizeof tmp2 / sizeof *tmp2, fp2);
 if (n2 == 0 && ferror(fp2)) {
 return -2;
 }
 size_t n_min = n1 < n2 ? n1 : n2;
 if (memcmp(tmp1, tmp2, n_min)) { // Quickly find if file contents differ ...
 for (size_t i = 0; i < n_min; i++) { // Slowly find where they differ
 if (tmp1[i] != tmp2[i]) {
 *offset += i;
 return 3;
 }
 }
 }
 *offset += n_min;
 if (n1 > n_min) {
 return 1;
 }
 if (n2 > n_min) {
 return 2;
 }
 } while (n1);
 return 0;
}

Is off_t standard? It's intended for holding file offsets.
@PeterCordes off_t is not part of the C standard. See stackoverflow.com/q/9073667/2410359 Perhaps you are thinking of fpos_t as used in fgetpos() and fsetpos()?
And if you want to be a bit more efficient than "slowly" you could do a binary search for the difference. E.g. first check the first COMP_N / 2 bytes, if they differ check the first COMP_N / 4, etc. This will locate the difference in logarithmic rather than linear time which is useful especially with more generous block sizes like 4096 (on average requiring 6 steps instead of 2048) although I'm not sure what the effect of caching would do in the "slow" loop version.
@CompuChip Your suggestion is still O(N) - yet perhaps with a linear improvement. N/2 1st iteration + N/4 2nd iteration + N/8 3rd + N/16 4th ... --> N. Given that files are large, efficiency of comparing the last block is likely minor. More significant speed improvements could be had with larger buffers, memory mapped I/O buffer, etc.

Stack Exchange Network

Comparing two large binary files in C

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Comparing two large binary files in C

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions