I had a silly assignment:
You are given a binary file FLOATS.DAT of unknown length, type
floatdata has been written in it. Write a program which sorts the data in the file (the output is a sorted file FLOATS.DAT). You need to sort it without loading the file into memory, or by using a temporary file. Use the Shell sort algorithm.
Obviously by using a temp. file it would be easy, even more if you could load it into memory.
Here is my solution:
#include <stdio.h>
// sizeof(float) = 4 on my machine
int main(int argc, char **argv)
{
size_t sajz = sizeof(float); // for convenience
int n; // total number of floats in the file
FILE *dat;
dat = fopen("FLOATS.DAT", "r+b");
if (dat == NULL)
{
printf("Error while opening.\n");
return 1;
}
// find the number of floats in the file
fseek(dat, 0, SEEK_END);
n = ftell(dat)/sajz;
rewind(dat);
float tjmh, tx;
int i,j,h;
// standard shell sort algorithm, the hard part is when I have to
// read a position I first have to jump to it, then read it,
// same when writing to a location
for (h = n/2; h>0; h/=2)
{
for (i=h; i<n; i++)
{
j = i;
fseek(dat, i*sajz, SEEK_SET);
fread(&tx, sajz, 1, dat);
rewind(dat);
fseek(dat, (j-h)*sajz, SEEK_SET);
fread(&tjmh, sajz, 1, dat);
rewind(dat);
for (; j>=h && tx < tjmh; j -= h)
{
fseek(dat, j*sajz, SEEK_SET);
fwrite(&tjmh, sajz, 1, dat);
rewind(dat);
// this was tricky to figure out
// I need the value of `j` after it has been
// subtracted from h
fseek(dat, (j - 2*h)*sajz, SEEK_SET);
fread(&tjmh, sajz, 1, dat);
rewind(dat);
}
fseek(dat, j*sajz, SEEK_SET);
fwrite(&tx, sajz, 1, dat);
rewind(dat);
}
}
fclose(dat);
return 0;
}
Can this be improved in any way?
Code snippet to generate the file with numbers 2.0, 3.0, 10.0, 4.0, 1.0, 7.0, 9.0, 5.0, 6.0, 8.0
int main()
{
FILE *dat;
dat = fopen("FLOATS.DAT", "wb");
float a;
if (dat)
{
a = 2.00;
fwrite(&a, sizeof(float), 1, dat);
a = 3.00;
fwrite(&a, sizeof(float), 1, dat);
a = 10.00;
fwrite(&a, sizeof(float), 1, dat);
a = 4.00;
fwrite(&a, sizeof(float), 1, dat);
a = 1.00;
fwrite(&a, sizeof(float), 1, dat);
a = 7.00;
fwrite(&a, sizeof(float), 1, dat);
a = 9.00;
fwrite(&a, sizeof(float), 1, dat);
a = 5.00;
fwrite(&a, sizeof(float), 1, dat);
a = 6.00;
fwrite(&a, sizeof(float), 1, dat);
a = 8.00;
fwrite(&a, sizeof(float), 1, dat);
fclose(dat);
}
return 0;
}
You can use an array here as well.
3 Answers 3
Don't Repeat Yourself (DRY)
fseek(dat, i*sajz, SEEK_SET); fread(&tx, sajz, 1, dat); rewind(dat);
You do this three times. The general rule is first time write, second time copy, third time refactor into a common function. Note that if there is sufficient commonality, you may create the common function with only two uses (e.g. with the fwrite version). Make a function:
void read_at_position(FILE *source, size_t position, size_t size, void *datum)
{
fseek(source, position * size, SEEK_SET);
fread(datum, size, 1, source);
rewind(source);
}
I made datum a void * because that's what fread takes. This way you don't have to write a new function if you change the datum type.
Now you can rewrite the original section (and the other four) like
read_at_position(dat, i, sajz, &tx);
If you're worried about the overhead, you could create a macro instead
#define READ_AT_POSITION(source, position, size, datum) \
do {\
fseek(source, (position) * (size), SEEK_SET);\
fread(datum, size, 1, source);\
rewind(source);\
} while (0)
which you'd use as
READ_AT_POSITION(dat, i, sajz, &tx);
And the compiler will turn it into your original version.
Note that a good compiler will probably do that with the function as well.
Better names
The only way that I know what sajz and dat do is from looking at their declarations. Good names should balance being easy to write (e.g. size) with being descriptive (size_of_each_datum or size_of_float). Perhaps datum_size instead of sajz.
Similar problem with tjmh and tx. Neither name tells me anything about what they do.
Consider changing h to gap_size. Yes, h is a standard name in a shellsort (as per Wikipedia), but will most people remember that if they see it?
Reusability
Rather than writing this directly into main consider making a function. E.g.
void sort_floats_in_file(FILE *source, size_t datum_size);
You could generalize this by passing in a comparison function.
void sort_in_file(FILE *source, size_t datum_size, int (*compar)(const void *, const void*));
However, that can add unnecessary overhead in simple cases like this. Perhaps the compiler takes care of that. Perhaps not.
Robustness
You say
size_t sajz = sizeof(float); // for convenience
And later
float tjmh, tx;
Consider flipping the order around and saying
float tjmh, tx;
size_t sajz = sizeof tjmh; // for convenience
Then if you change the data type of tjmh and tx, sajz will change to match automatically. As is, it would be easy to forget and only change in the one place.
Similarly,
float a; if (dat) { a = 2.00; fwrite(&a, sizeof(float), 1, dat);
Consider rewriting this as
if (dat)
{
test_write(2.00, dat);
with
void test_write(float a, FILE *dat)
{
fwrite(&a, sizeof a, 1, dat);
}
Not only is this less typing, it avoids the problem of changing the type in one place and not the other. Admittedly there's some extra work to write the function, but the reduction per test case should cover it.
-
\$\begingroup\$ Hey, thanks for the write up! Robustnes, Reusability Since this was just a test assignment those factor were not crucial. But your points are still very valid. I especially like the part about the macro. Better names English is not my first language, so
sajzkind of appropriate, I should have changed it before posting, sorry. Also,tjwasj temporaryandtjmhwasj minus h temporarysince I needed values at those positions (a comment would have been helpful I admit). \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 12:49:55 +00:00Commented Jul 20, 2016 at 12:49 -
\$\begingroup\$ Robustness No particular comment here, all great advice. Especially the
sizeof(variable)instead ofsizeof(float). I actually use this, but didn't use it here. \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 12:52:38 +00:00Commented Jul 20, 2016 at 12:52 -
\$\begingroup\$ Also, why did you use the name
datumfor the variable name? \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 13:02:03 +00:00Commented Jul 20, 2016 at 13:02 -
1\$\begingroup\$ @smrdo_prdo
datumis the singular form ofdata. It's just a moderately generic name for a single element. I prefer it when dealing with the value. I'd useelementfor a structural element, e.g. a node in a linked list. \$\endgroup\$mdfst13– mdfst132016年07月20日 14:20:57 +00:00Commented Jul 20, 2016 at 14:20 -
1\$\begingroup\$ @user6245072 No. Most compilers actually ignore
inlineand make their own decisions about what to inline and what not. Theinlinekeyword is used to indicate that identical function prototypes should be ignored rather than cause a compiler error. Even prior to that, compilers have always been able to choose to inline a function if their rules found it best. The only kind of function that it can't inline is one that is defined in another compilation unit. \$\endgroup\$mdfst13– mdfst132016年07月20日 23:53:28 +00:00Commented Jul 20, 2016 at 23:53
Actually loads file into memory
By virtue of using fread and fwrite, you are using buffered I/O and you are actually loading a large part of the file into memory. In fact, for your small test case, the entire file will be loaded into memory on the first fread. To avoid this, you either need to:
- Use
read()andwrite()instead. - Turn buffering off using
setvbuf().
Be kind, rewind?
Your calls to rewind() are unnecessary because you always use fseek(..., SEEK_SET) which sets the file position from the start of the file. So you can just remove all your calls to rewind().
-
1\$\begingroup\$ There's a tight upper bound on how much total memory will be used at any one time by stdio buffers. So this is an in-place sort that uses O(1) space to sort an arbitrary sized file. That should satisfy the spirit of the assignment. If you want to avoid ending up with the whole file in memory (in the pagecache) you'd need to
open(..., O_RDRW|O_DIRECT)to use direct I/O (problematic since IIRC you can only read / write whole pages), or usemadvise(MADV_DONTNEED)to flush the kernel's disk cache. Upvoted for rewind, though. \$\endgroup\$Peter Cordes– Peter Cordes2016年07月19日 10:47:33 +00:00Commented Jul 19, 2016 at 10:47 -
3\$\begingroup\$ Unclear why suggest
read()instead offread()as post is not system specific andread()is not in the standard C library. Supposeread()is likely to exist. \$\endgroup\$chux– chux2016年07月19日 21:26:57 +00:00Commented Jul 19, 2016 at 21:26 -
\$\begingroup\$ I figured out the part about
rewindafter I posted this. \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 12:53:17 +00:00Commented Jul 20, 2016 at 12:53 -
1\$\begingroup\$ @smrdo_prdo: I wouldn't worry about it. Just let stdio use its small buffers. The point of the assignment is that the program doesn't need to load the whole data set into memory, not that it actively avoids doing so even when it's so small it happens by accident. \$\endgroup\$Peter Cordes– Peter Cordes2016年07月21日 01:38:11 +00:00Commented Jul 21, 2016 at 1:38
-
2\$\begingroup\$ @smrdo_prdo: 1. Yes, it's of course possible to disable input and output buffering for a stream, and JS1's answer says how. 3. Your program really does load the whole file into memory, but only for small inputs. IDK where you got the "no RAM" idea. 3. No, it's not possible to implement C on something that doesn't have RAM or something analogous. C requires being able to take the address of almost anything. (A CPU can have all the RAM it needs built-in, on the same silicon, so it doesn't need any external DRAM chips, but that's not the same thing). \$\endgroup\$Peter Cordes– Peter Cordes2016年07月21日 11:44:48 +00:00Commented Jul 21, 2016 at 11:44
Additions to previous 2 fine answers
Check I/O results. Robust check for errors, especially I/O. Code does use
if (dat == NULL), which is good. Consider checking other file I/O.size_t sajz = sizeof(float); // for convenience int n; // total number of floats in the file ... // fseek(dat, 0, SEEK_END); // n = ftell(dat)/sajz; // rewind(dat); if (fseek(dat, 0, SEEK_END)) Handle_Error(); long pos = ftell(dat); if (pos == -1) Handle_Error(); n = pos/sajz; rewind(dat); // No error returnUse
longrather thanint. Code is limited to filesLONG_MAXor less in size as code useslong ftell(). Yet code mixesintwithlongwith various calculations. This only makes a difference when 1)INT_MAX < LONG_MAX, and 2) the file is huge. Yet this is precisely the scenario to use an in-place sort: when the file is huge.// int n; // total number of floats in the file // ... // int i,j,h; long n; // total number of floats in the file ... long i,j,h;Hidden compare:
tx < tjmhinfor (; j>=h && tx < tjmh; j -= h). Suggest un-hiding it. The central computation of the entire code is this compare.// for (; j>=h && tx < tjmh; j -= h) for (; j>=h && cmp(tx, tjmh); j -= h) { // or for (; j>=h; j -= h) { if (tx < tjmh) break;Consider that code may want another compare. Example: With typical floating point, Not-a-number and -0.0 may occur. How will your code sort that, if
<is not to coding goals? Perhaps a helper function.int cmp_lt(float f1, float f2) { // first 3 if's return a result when f1, f1 are well ordered. (Neither NaN) if (f1 < f2) return 1; if (f1 > f2) return 0; if (f1 == f2) { if (f1 == 0.0f && (signbit(f1) != signbit(f2)) { // f1 and f2 are both some 0.0 (one + and one -) // let -0.0 be "less than" +0.0 return signbit(f1) == 0; } // Consider then equal return 0; } // at least 1 of f1,f2 is a NaN return isnan(f2); // Consider f1 "less than" f2 when f2 is NaN } for (; j>=h && cmp_lt(tx, tjmh); j -= h) {
-
\$\begingroup\$ I understand points 1 and 2. Can you explain what do you mean by hidden compare? For point 4 I have to read and learn a bit. Thanks! \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 13:01:01 +00:00Commented Jul 20, 2016 at 13:01
-
2\$\begingroup\$ @smrdo_prdo The point of code is to sort a file. The important code to sorting is the compare. The code
tx < tjmhshould be highlighted or made more obvious. When looking at the code, it was "hidden" in that it takes some time to find. \$\endgroup\$chux– chux2016年07月20日 13:05:46 +00:00Commented Jul 20, 2016 at 13:05 -
\$\begingroup\$ Ohhh, I see, got it. Never came across that before (nor would I think that by myself) . Thanks again! \$\endgroup\$smrdo_prdo– smrdo_prdo2016年07月20日 13:14:30 +00:00Commented Jul 20, 2016 at 13:14
mmap()be considered "loading the file into memory"? \$\endgroup\$floatdata", one first has to determine whether it's possible to unambiguously determine the non-portable bit representation that was used to write it, otherwise it's of no use... and even if you can figure that out, your system might differ, meaning you'd have to convert to its representation, which isn't necessarily trivial. All of which supports your conclusion that - like so many others - it's a "silly assignment". ;) \$\endgroup\$net2HostIEEE()andhost2NetIEEE()\$\endgroup\$