Sort a binary file without loading it into memory or using a temporary file

Question 1

I had a silly assignment:

You are given a binary file FLOATS.DAT of unknown length, type float data has been written in it. Write a program which sorts the data in the file (the output is a sorted file FLOATS.DAT). You need to sort it without loading the file into memory, or by using a temporary file. Use the Shell sort algorithm.

Obviously by using a temp. file it would be easy, even more if you could load it into memory.

Here is my solution:

#include <stdio.h>
// sizeof(float) = 4 on my machine
int main(int argc, char **argv)
{
 size_t sajz = sizeof(float); // for convenience 
 int n; // total number of floats in the file
 FILE *dat;
 dat = fopen("FLOATS.DAT", "r+b");
 if (dat == NULL)
 {
 printf("Error while opening.\n");
 return 1;
 }
 // find the number of floats in the file
 fseek(dat, 0, SEEK_END);
 n = ftell(dat)/sajz;
 rewind(dat);
 float tjmh, tx;
 int i,j,h;
 // standard shell sort algorithm, the hard part is when I have to
 // read a position I first have to jump to it, then read it,
 // same when writing to a location
 for (h = n/2; h>0; h/=2)
 {
 for (i=h; i<n; i++)
 {
 j = i;
 fseek(dat, i*sajz, SEEK_SET);
 fread(&tx, sajz, 1, dat);
 rewind(dat);
 fseek(dat, (j-h)*sajz, SEEK_SET);
 fread(&tjmh, sajz, 1, dat);
 rewind(dat);
 for (; j>=h && tx < tjmh; j -= h)
 {
 fseek(dat, j*sajz, SEEK_SET);
 fwrite(&tjmh, sajz, 1, dat);
 rewind(dat);
 // this was tricky to figure out
 // I need the value of `j` after it has been
 // subtracted from h
 fseek(dat, (j - 2*h)*sajz, SEEK_SET);
 fread(&tjmh, sajz, 1, dat);
 rewind(dat);
 }
 fseek(dat, j*sajz, SEEK_SET);
 fwrite(&tx, sajz, 1, dat);
 rewind(dat);
 }
 }
 fclose(dat);
 return 0;
}

Can this be improved in any way?

Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0

int main()
{
 FILE *dat;
 dat = fopen("FLOATS.DAT", "wb");
 float a;
 if (dat)
 {
 a = 2.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 3.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 10.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 4.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 1.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 7.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 9.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 5.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 6.00;
 fwrite(&a, sizeof(float), 1, dat);
 a = 8.00;
 fwrite(&a, sizeof(float), 1, dat);
 fclose(dat);
 }
 return 0;
}

You can use an array here as well.

Question 2

Where did this assignment come from? Would mmap() be considered "loading the file into memory"?

Question 3

Of course, in reality if one is "given [implying by someone else] a binary file FLOATS.DAT of unknown length, type float data", one first has to determine whether it's possible to unambiguously determine the non-portable bit representation that was used to write it, otherwise it's of no use... and even if you can figure that out, your system might differ, meaning you'd have to convert to its representation, which isn't necessarily trivial. All of which supports your conclusion that - like so many others - it's a "silly assignment". ;)

Question 4

@underscore_d: in practice, the IEEE-754 floating point representation is near-universal. gcc defines macros for float endianness, but IDK if that's an issue on real CPUs the way integer endianness is.

Question 5

@underscore_d: I have written some code that transforms native floats to/from IEEE binary representation using standard functions. github.com/Loki-Astari/ThorsSerializer/blob/master/src/… (of course this is a noop if your system is already IEEE compliant). net2HostIEEE() and host2NetIEEE()

Question 6

@PeterCordes: Actually it's not as simple as that. Unfortunately lots of systems use 80 bit floats (because that is a standard size of FPU that is installed). If you use "double" on these systems this gets truncated down to the standard IEEE Double precision (ie 64 bit) value when saved to a memory location. But if you use a "long double" this does not get transformed into the IEEE Quad precision (ie 128 bit value) it is usually stored as 80 bit zero padded. This padded value is not IEEE Interchange format. Unless you start flipping some compiler floating point flags.

Question 7

Don't Repeat Yourself (DRY)

 fseek(dat, i*sajz, SEEK_SET);
 fread(&tx, sajz, 1, dat);
 rewind(dat);

You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the fwrite version). Make a function:

void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
 fseek(source, position * size, SEEK_SET);
 fread(datum, size, 1, source);
 rewind(source);
}

I made datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type.

Now you can rewrite the original section (and the other four) like

 read_at_position(dat, i, sajz, &tx);

If you're worried about the overhead, you could create a macro instead

#define READ_AT_POSITION(source, position, size, datum) \
 do {\
 fseek(source, (position) * (size), SEEK_SET);\
 fread(datum, size, 1, source);\
 rewind(source);\
 } while (0)

which you'd use as

 READ_AT_POSITION(dat, i, sajz, &tx);

And the compiler will turn it into your original version.

Note that a good compiler will probably do that with the function as well.

Better names

The only way that I know what sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz.

Similar problem with tjmh and tx. Neither name tells me anything about what they do.

Consider changing h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it?

Reusability

Rather than writing this directly into main consider making a function. E.g.

void sort_floats_in_file(FILE *source, size_t datum_size);

You could generalize this by passing in a comparison function.

void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));

However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.

Robustness

You say

 size_t sajz = sizeof(float); // for convenience

And later

 float tjmh, tx;

Consider flipping the order around and saying

 float tjmh, tx;
 size_t sajz = sizeof tjmh; // for convenience

Then if you change the data type of tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place.

Similarly,

 float a;
 if (dat)
 {
 a = 2.00;
 fwrite(&a, sizeof(float), 1, dat);

Consider rewriting this as

 if (dat)
 {
 test_write(2.00, dat);

with

void test_write(float a, FILE *dat)
{
 fwrite(&a, sizeof a, 1, dat);
}

Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.

Question 8

Hey, thanks for the write up! Robustnes, Reusability Since this was just a test assignment those factor were not crucial. But your points are still very valid. I especially like the part about the macro. Better names English is not my first language, so sajz kind of appropriate, I should have changed it before posting, sorry. Also, tj was j temporary and tjmh was j minus h temporary since I needed values at those positions (a comment would have been helpful I admit).

Question 9

Robustness No particular comment here, all great advice. Especially the sizeof(variable) instead of sizeof(float). I actually use this, but didn't use it here.

Question 10

Also, why did you use the name datum for the variable name?

Question 11

@smrdo_prdo datum is the singular form of data. It's just a moderately generic name for a single element. I prefer it when dealing with the value. I'd use element for a structural element, e.g. a node in a linked list.

Question 12

@user6245072 No. Most compilers actually ignore inline and make their own decisions about what to inline and what not. The inline keyword is used to indicate that identical function prototypes should be ignored rather than cause a compiler error. Even prior to that, compilers have always been able to choose to inline a function if their rules found it best. The only kind of function that it can't inline is one that is defined in another compilation unit.

Question 13

Actually loads file into memory

By virtue of using fread and fwrite, you are using buffered I/O and you are actually loading a large part of the file into memory. In fact, for your small test case, the entire file will be loaded into memory on the first fread. To avoid this, you either need to:

Use read() and write() instead.
Turn buffering off using setvbuf().

Be kind, rewind?

Your calls to rewind() are unnecessary because you always use fseek(..., SEEK_SET) which sets the file position from the start of the file. So you can just remove all your calls to rewind().

Question 14

There's a tight upper bound on how much total memory will be used at any one time by stdio buffers. So this is an in-place sort that uses O(1) space to sort an arbitrary sized file. That should satisfy the spirit of the assignment. If you want to avoid ending up with the whole file in memory (in the pagecache) you'd need to open(..., O_RDRW|O_DIRECT) to use direct I/O (problematic since IIRC you can only read / write whole pages), or use madvise(MADV_DONTNEED) to flush the kernel's disk cache. Upvoted for rewind, though.

Question 15

Unclear why suggest read() instead of fread() as post is not system specific and read() is not in the standard C library. Suppose read() is likely to exist.

Question 16

I figured out the part about rewind after I posted this.

Question 17

@smrdo_prdo: I wouldn't worry about it. Just let stdio use its small buffers. The point of the assignment is that the program doesn't need to load the whole data set into memory, not that it actively avoids doing so even when it's so small it happens by accident.

Question 18

@smrdo_prdo: 1. Yes, it's of course possible to disable input and output buffering for a stream, and JS1's answer says how. 3. Your program really does load the whole file into memory, but only for small inputs. IDK where you got the "no RAM" idea. 3. No, it's not possible to implement C on something that doesn't have RAM or something analogous. C requires being able to take the address of almost anything. (A CPU can have all the RAM it needs built-in, on the same silicon, so it doesn't need any external DRAM chips, but that's not the same thing).

Question 19

Additions to previous 2 fine answers

Check I/O results. Robust check for errors, especially I/O. Code does use if (dat == NULL), which is good. Consider checking other file I/O.

size_t sajz = sizeof(float); // for convenience 
int n; // total number of floats in the file
... 
// fseek(dat, 0, SEEK_END);
// n = ftell(dat)/sajz;
// rewind(dat);
if (fseek(dat, 0, SEEK_END)) Handle_Error();
long pos = ftell(dat);
if (pos == -1) Handle_Error();
n = pos/sajz;
rewind(dat); // No error return

Use long rather than int. Code is limited to files LONG_MAX or less in size as code uses long ftell(). Yet code mixes int with long with various calculations. This only makes a difference when 1) INT_MAX < LONG_MAX, and 2) the file is huge. Yet this is precisely the scenario to use an in-place sort: when the file is huge.
```
// int n; // total number of floats in the file
// ...
// int i,j,h;
long n; // total number of floats in the file
...
long i,j,h;
```
Hidden compare: tx < tjmh in for (; j>=h && tx < tjmh; j -= h). Suggest un-hiding it. The central computation of the entire code is this compare.
```
// for (; j>=h && tx < tjmh; j -= h)
for (; j>=h && cmp(tx, tjmh); j -= h) {
// or 
for (; j>=h; j -= h) {
 if (tx < tjmh) break;
```

Consider that code may want another compare. Example: With typical floating point, Not-a-number and -0.0 may occur. How will your code sort that, if < is not to coding goals? Perhaps a helper function.

int cmp_lt(float f1, float f2) {
 // first 3 if's return a result when f1, f1 are well ordered. (Neither NaN)
 if (f1 < f2) return 1;
 if (f1 > f2) return 0;
 if (f1 == f2) {
 if (f1 == 0.0f && (signbit(f1) != signbit(f2)) {
 // f1 and f2 are both some 0.0 (one + and one -)
 // let -0.0 be "less than" +0.0
 return signbit(f1) == 0;
 }
 // Consider then equal
 return 0;
 }
 // at least 1 of f1,f2 is a NaN
 return isnan(f2); // Consider f1 "less than" f2 when f2 is NaN
}
for (; j>=h && cmp_lt(tx, tjmh); j -= h) {

Question 20

I understand points 1 and 2. Can you explain what do you mean by hidden compare? For point 4 I have to read and learn a bit. Thanks!

Question 21

@smrdo_prdo The point of code is to sort a file. The important code to sorting is the compare. The code tx < tjmh should be highlighted or made more obvious. When looking at the code, it was "hidden" in that it takes some time to find.

Question 22

Ohhh, I see, got it. Never came across that before (nor would I think that by myself) . Thanks again!

mdfst13 22.4k6 gold badges34 silver badges70 bronze badges · Accepted Answer · 2016-07-19 03:02:38Z

Don't Repeat Yourself (DRY)

 fseek(dat, i*sajz, SEEK_SET);
 fread(&tx, sajz, 1, dat);
 rewind(dat);

You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the fwrite version). Make a function:

void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
 fseek(source, position * size, SEEK_SET);
 fread(datum, size, 1, source);
 rewind(source);
}

I made datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type.

Now you can rewrite the original section (and the other four) like

 read_at_position(dat, i, sajz, &tx);

If you're worried about the overhead, you could create a macro instead

#define READ_AT_POSITION(source, position, size, datum) \
 do {\
 fseek(source, (position) * (size), SEEK_SET);\
 fread(datum, size, 1, source);\
 rewind(source);\
 } while (0)

which you'd use as

 READ_AT_POSITION(dat, i, sajz, &tx);

And the compiler will turn it into your original version.

Note that a good compiler will probably do that with the function as well.

Better names

The only way that I know what sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz.

Similar problem with tjmh and tx. Neither name tells me anything about what they do.

Consider changing h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it?

Reusability

Rather than writing this directly into main consider making a function. E.g.

void sort_floats_in_file(FILE *source, size_t datum_size);

You could generalize this by passing in a comparison function.

void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));

However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.

Robustness

You say

 size_t sajz = sizeof(float); // for convenience

And later

 float tjmh, tx;

Consider flipping the order around and saying

 float tjmh, tx;
 size_t sajz = sizeof tjmh; // for convenience

Then if you change the data type of tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place.

Similarly,

 float a;
 if (dat)
 {
 a = 2.00;
 fwrite(&a, sizeof(float), 1, dat);

Consider rewriting this as

 if (dat)
 {
 test_write(2.00, dat);

with

void test_write(float a, FILE *dat)
{
 fwrite(&a, sizeof a, 1, dat);
}

Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.

Hey, thanks for the write up! Robustnes, Reusability Since this was just a test assignment those factor were not crucial. But your points are still very valid. I especially like the part about the macro. Better names English is not my first language, so sajz kind of appropriate, I should have changed it before posting, sorry. Also, tj was j temporary and tjmh was j minus h temporary since I needed values at those positions (a comment would have been helpful I admit).
Robustness No particular comment here, all great advice. Especially the sizeof(variable) instead of sizeof(float). I actually use this, but didn't use it here.
Also, why did you use the name datum for the variable name?
@smrdo_prdo datum is the singular form of data. It's just a moderately generic name for a single element. I prefer it when dealing with the value. I'd use element for a structural element, e.g. a node in a linked list.
@user6245072 No. Most compilers actually ignore inline and make their own decisions about what to inline and what not. The inline keyword is used to indicate that identical function prototypes should be ignored rather than cause a compiler error. Even prior to that, compilers have always been able to choose to inline a function if their rules found it best. The only kind of function that it can't inline is one that is defined in another compilation unit.

Stack Exchange Network

Sort a binary file without loading it into memory or using a temporary file

3 Answers 3

Don't Repeat Yourself (DRY)

Better names

Reusability

Robustness

Actually loads file into memory

Be kind, rewind?

You must log in to answer this question.

Hot Network Questions

Sort a binary file without loading it into memory or using a temporary file

3 Answers 3

Don't Repeat Yourself (DRY)

Better names

Reusability

Robustness

Actually loads file into memory

Be kind, rewind?

You must log in to answer this question.

Related

Hot Network Questions