I'm trying to read two sufficiently large binary files, comparing them and printing the offset at which they differ. I'm using fread
to read the binary files and memcmp
to compare them.
#include<stdio.h>
#include<time.h>
clock_t start, end;
int main(int argc, char *argv[])
{
double cpu_time_taken;
FILE *fp1, *fp2;
printf("\nArgument count: %d", argc);
printf("\nFile 1 is: %s", argv[1]);
printf("\nFile 2 is: %s\n", argv[2]);
if (argc < 3)
{
printf("\nInsufficient Arguments: \n");
printf("\nHelp:./executable <filename1> <filename2>\n");
return 0;
}
else
{
fp1 = fopen(argv[1], "rb");
if (fp1 == NULL)
{
printf("\nError in opening file %s", argv[1]);
return 0;
}
fp2 = fopen(argv[2], "rb");
if (fp2 == NULL)
{
printf("\nError in opening file %s", argv[2]);
return 0;
}
if ((fp1 != NULL) && (fp2 != NULL))
{
start = clock();
compare_two_binary_files(fp1, fp2);
end = clock();
cpu_time_taken = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("\nTime taken to compare: %f", cpu_time_taken*1000);
}
}
}
int compare_two_binary_files(FILE *fp1, FILE *fp2)
{
char tmp1[16], tmp2[16];
size_t bytes = 0, readsz = sizeof tmp1;
int count = 0;
while (!feof(fp1) || !feof(fp2)){
fread (tmp1, sizeof *tmp1, readsz, fp1);
fread (tmp2, sizeof *tmp2, readsz, fp2);
count += 16;
if(memcmp(tmp1, tmp2, readsz)){
for(int i=0; i < readsz; i++){
printf ("%d: 0x%02x ",i, tmp1[i]);
}
printf("\n%x", count);
return 0;
}
}
}
4 Answers 4
I'm trying to read two sufficiently large binary files, comparing them and printing the offset at which they differ.
compare_two_binary_files()
is faulty for various reasons.
The offset in which they differ can readily exceed the
int
range. Recommend the widest integer available instead. File sizes are not limited by the integer types available, yetuintmax_t
is a very reasonable assumption of being sufficient for a file size.Checking
feof()
only ignores the possibility of a rare input error messing up the compare.compare_two_binary_files()
can return without a returning a value. A compiler warning should have occurred. Enable all warnings or use a better compiler.Ignoring the return value of
fread()
is wrong.if(memcmp(tmp1, tmp2, readsz)){
is questionable. Theif()
is true when the buffer differ.
Suggested alternative:
// Use something more generous than 16. Maybe even 4096 or 64k and allocate buffers
#define CMP_N 256
// Return value:
// 0: files compare equal in content and length, fp1 size saved as offset
// 1: files differ, fp1 longer, fp2 size saved as offset
// 2: files differ, fp2 longer, fp1 size saved as offset
// 3: files differ at offset
// -1: fp1 trouble reading. Unspecified data in offset
// -2: fp2 trouble reading. Unspecified data in offset
int compare_two_binary_files_alternate(FILE *fp1, FILE *fp2, uintmax_t *offset) {
char tmp1[CMP_N], tmp2[CMP_N];
size_t n1, n2;
rewind(fp1); // start at beginning and clear error
rewind(fp2);
*offset = 0;
do {
n1 = fread(tmp1, sizeof *tmp1, sizeof tmp1 / sizeof *tmp1, fp1);
if (n1 == 0 && ferror(fp1)) {
return -1;
}
n2 = fread(tmp2, sizeof *tmp2, sizeof tmp2 / sizeof *tmp2, fp2);
if (n2 == 0 && ferror(fp2)) {
return -2;
}
size_t n_min = n1 < n2 ? n1 : n2;
if (memcmp(tmp1, tmp2, n_min)) { // Quickly find if file contents differ ...
for (size_t i = 0; i < n_min; i++) { // Slowly find where they differ
if (tmp1[i] != tmp2[i]) {
*offset += i;
return 3;
}
}
}
*offset += n_min;
if (n1 > n_min) {
return 1;
}
if (n2 > n_min) {
return 2;
}
} while (n1);
return 0;
}
-
\$\begingroup\$ Is
off_t
standard? It's intended for holding file offsets. \$\endgroup\$Peter Cordes– Peter Cordes2018年09月28日 01:20:42 +00:00Commented Sep 28, 2018 at 1:20 -
3\$\begingroup\$ @PeterCordes
off_t
is not part of the C standard. See stackoverflow.com/q/9073667/2410359 Perhaps you are thinking offpos_t
as used infgetpos()
andfsetpos()
? \$\endgroup\$chux– chux2018年09月28日 01:57:39 +00:00Commented Sep 28, 2018 at 1:57 -
\$\begingroup\$ And if you want to be a bit more efficient than "slowly" you could do a binary search for the difference. E.g. first check the first
COMP_N / 2
bytes, if they differ check the firstCOMP_N / 4
, etc. This will locate the difference in logarithmic rather than linear time which is useful especially with more generous block sizes like 4096 (on average requiring 6 steps instead of 2048) although I'm not sure what the effect of caching would do in the "slow" loop version. \$\endgroup\$CompuChip– CompuChip2018年09月28日 20:07:29 +00:00Commented Sep 28, 2018 at 20:07 -
\$\begingroup\$ @CompuChip Your suggestion is still O(N) - yet perhaps with a linear improvement. N/2 1st iteration + N/4 2nd iteration + N/8 3rd + N/16 4th ... --> N. Given that files are large, efficiency of comparing the last block is likely minor. More significant speed improvements could be had with larger buffers, memory mapped I/O buffer, etc. \$\endgroup\$chux– chux2018年09月28日 20:40:39 +00:00Commented Sep 28, 2018 at 20:40
fread (tmp1, sizeof *tmp1, readsz, fp1); fread (tmp2, sizeof *tmp2, readsz, fp2); count += 16; if(memcmp(tmp1, tmp2, readsz)){ ... }
You are discarding the return values of the fread()
calls, blindly assuming that they both successfully read 16 bytes.
It is unclear what the return value of compare_two_binary_files(..., ...)
means. In fact, you sometimes don't return a value at all. Your compiler should have warned you about that problem.
File I/O should be done a block at a time: 512 bytes, 1024 bytes, or 2048 bytes would be a more reasonable chunk size than 16.
Technically, most of the time is spent waiting for I/O. cpu_time_taken
is a misnomer: what you're measuring is called "wall clock time".
-
\$\begingroup\$ compare_two_binary_files(..., ...) should just print the offset where the difference is found. It should not return any value, but my compiler didn't allow return so I just included an int return type. And could you elaborate on why cpu_time_taken is a misnomer and is there any right way I can measure the time taken to compare the binary files? \$\endgroup\$ampat– ampat2018年09月27日 23:12:11 +00:00Commented Sep 27, 2018 at 23:12
-
\$\begingroup\$ If you have nothing to return, change the return type to
void
. \$\endgroup\$200_success– 200_success2018年09月27日 23:14:18 +00:00Commented Sep 27, 2018 at 23:14 -
2\$\begingroup\$ 4096 would be a good default size. 1 whole page on some common architectures like x86, and also 1 disk physical sector on modern rotational media, and many NAND flash devices. The kernel buffers I/O in normal hosted C implementations that run under a modern OS, but I wouldn't count on C stdio buffering input to do
read()
system calls in page-size chunks. (It might though. Or better, if stdio usesstat(2)
to ask the OS for the optimal I/O block size.) \$\endgroup\$Peter Cordes– Peter Cordes2018年09月28日 01:19:15 +00:00Commented Sep 28, 2018 at 1:19 -
\$\begingroup\$ If the files are "sufficiently large" - I was thinking Gbytes at a minimum, if not Tbytes - then the OP should look at the asynchronous I/O library. \$\endgroup\$jamesqf– jamesqf2018年09月28日 04:39:58 +00:00Commented Sep 28, 2018 at 4:39
if ((fp1 != NULL) && (fp2 != NULL))
test is redundant. If one of them happened to beNULL
, the program is already terminated.Don't print error messages to the
stdout
. There isstderr
for that purpose.When printing to stdout, keep in mind that it is usually line buffered. The text stays in the internal buffer until a newline is printed. That's why it is important to print a newline after the message, not before.
printf("\nError in opening file %s", argv[1]);
doesn't tell the most important part: why didfopen
fail. Printstrerror(errno)
as well. Combining the above bullets,fprintf(stderr, "Error in opening file %s: %s\n", argv[1], strerror(errno));
You'd need to
#include <errno.h>
for that.
-
\$\begingroup\$ "why it is important to print a newline after the message" is not necessarily sufficient. What are the rules of automatic flushing stdout buffer in C?. Consider
fflush(stdout);
. \$\endgroup\$chux– chux2018年09月27日 19:50:02 +00:00Commented Sep 27, 2018 at 19:50 -
1\$\begingroup\$ I like the idea of of informative error messages, yet printing a faulty file name is sometimes itself a problem. At a minimum, it is useful to include sententials to demarcate the iffy file name.
file %s
-->file \"%s\"
. \$\endgroup\$chux– chux2018年09月27日 20:56:49 +00:00Commented Sep 27, 2018 at 20:56 -
\$\begingroup\$ @chux: that's part of why error messages should be printed on stderr, which is unbuffered by default. If
stdout
isn't going to a terminal, full buffering is normally fine. Unless you're writing a shell or something that prompts and waits for input, that someone might want to automate with pipes, you don't needfflush
after printing a newline. \$\endgroup\$Peter Cordes– Peter Cordes2018年09月28日 01:25:26 +00:00Commented Sep 28, 2018 at 1:25 -
\$\begingroup\$ @PeterCordes Agree about "is normally fine". Using
fflush(stdout);
always flushes regardless of implementation aspects of the character, line fully buffered mode or OS/shell properties. And it also does not oblige a change in functionality (adding a final'\n'
). \$\endgroup\$chux– chux2018年09月28日 02:05:44 +00:00Commented Sep 28, 2018 at 2:05 -
\$\begingroup\$ @chux: My point was that for this case, it's clearly better to end the message with a newline and print it to stderr. Either of which would be sufficient for this case, unless abnormal termination that doesn't flush stdout is possible. (e.g.
abort()
.) But both, as suggested by this answer, are better for formatting reasons as well as flushing correctness. \$\endgroup\$Peter Cordes– Peter Cordes2018年09月28日 02:08:31 +00:00Commented Sep 28, 2018 at 2:08
Other answers focus nicely on the comparison function, so I'll just mention a couple things about the start of main.
What does this do when you run the program with no arguments?
printf("\nFile 1 is: %s", argv[1]);
printf("\nFile 2 is: %s\n", argv[2]);
You can't check argv[1]
or argv[2]
if there are not enough arguments. However, you do check that immediately after, so you can just cut-paste them a few lines down.
int main(int argc, char *argv[])
{
double cpu_time_taken;
FILE *fp1, *fp2;
printf("\nArgument count: %d", argc);
if (argc < 3)
{
printf("\nInsufficient Arguments: \n");
printf("\nHelp:./executable <filename1> <filename2>\n");
return 0;
}
else
{
printf("\nFile 1 is: %s", argv[1]);
printf("\nFile 2 is: %s\n", argv[2]);
fp1 = fopen(argv[1], "rb");
//...
}
}
You can simplify it further still - your else
is unnecessary because the if
terminates the program, so it won't run the content of the else
regardless. I find it easier to read when there are fewer indentations, but maybe that's just me. Also, the if
catches an error in running the program, and informs the user. If this is being called from a script, you'd want to return something other than 0, because 0 indicates success. So you can make one more simplification:
int main(int argc, char *argv[])
{
double cpu_time_taken;
FILE *fp1, *fp2;
printf("\nArgument count: %d", argc);
if (argc < 3)
{
printf("\nInsufficient Arguments: \n");
printf("\nHelp:./executable <filename1> <filename2>\n");
return 1;
}
printf("\nFile 1 is: %s", argv[1]);
printf("\nFile 2 is: %s\n", argv[2]);
fp1 = fopen(argv[1], "rb");
//...
}
I echo the advice in other answers about putting the newline at the end of your print, rather than before.
cmp
command? If not, at least look at its source and use that as a basis for your version. \$\endgroup\$printf ("%d: 0x%02x ", count+i, tmp1[i]);
to have global offset of differences. Withi
you see bytes in current block. Also, usesizeof tmp1
instead ofsizeof *tmp1
andcount += sizeof tmp1
. \$\endgroup\$mmap
, but there are situations wheremmap
won't work. Not all operating systems supportmmap
, and on OS where it is supported it isn't all file systems which support it. Moreovermmap
only works on regular files. And if you want to support files larger than your address space, themmap
solution isn't going to be much simpler. \$\endgroup\$mmap
in C. \$\endgroup\$