Best way to create "interruptable" executable

Question 1

The job of my Unix executable file is to perform a long computation, and I added a interrupt/resume functionality to it as explained below.

At regular intervals, the program writes all relevant data found so far in a checkpoint file, which can then be used as a starting point for a "resume" operation.

To interrupt the program, I use Ctrl+C.
The only problem with this methodology is that, if the interruption occurs when the program is writing into the file, I am left with a useless half written file.

The only fix I could find so far is as follows:

make the program
write into two files, so that at restart time one of them will be readable.

Is there a cleaner, better way to create an "interruptable" Unix executable ?

Question 2

The thing you're asking about is called "checkpointing" so adding that word somewhere should allow future users to identify your question as relevant more easily.

Question 3

@JamesYoungman Unfortunately, "checkpointing" is not a tag here, and I do not have enough ratings to create a new tag

Question 4

Your main concern: "when the program is writing into the file, I am left with a useless half written file" can be addressed by writing a temporary file and then atomically replacing the target file with it. This ensures data at the expected location is always consistent.

Question 5

@phg I must confess that I do not really know what the "atomic" operations are in Unix. Could you clarify what you mean by "atomically replacing" ?

Question 6

rcrowley.org/2010/01/06/things-unix-can-do-atomically.html

Question 7

It depends a bit on if you care only about the program itself crashing, or the whole system crashing.

In the first case, you could write the fresh data to a new file, and then rename that to the real name only after you're done writing. That way the file will contain either the previous, or the new checkpoint data, but never only partial information. Though partial writes should be rare enough in any case, if we assume the checkpointing code itself is not likely to fail, and if relevant signals are trapped to make sure the program saves a new checkpoint in full before exiting. (In addition to SIGINT, I think you'd better catch SIGHUP and SIGTERM too.)

If we consider the possibility of the whole system crashing, then I wouldn't trust only one checkpoint file. The data is not likely to actually be on the disk when system returns from the file write system call. Instead, the OS and the disk itself are likely to cache the data and actually write it some time later. So leaving one or two previous checkpoints would work as a failsafe against that.

Question 8

The use of more than one file is a good idea. Whether the extra burden if writing and maintaining garbage collection is worthwhile depends on the value of the computation saved and the frequency of file system corrupting system crashes.

Question 9

Do you believe that sync and fsync aren't good enough to get the data written to the disk?

Question 10

@G-Man, well for one, I don't trust all drives to actually write even though they might have promised. At least without battery-backup for the cache (or the whole drive). Second, after the O_PONIES fiasco, I'm a bit disillusioned about crash-safety. But yeah, if you have battery backup, trust your OS and remember to fsync both the new file and the containing directory after renaming a new file into place, then yeah, you might be safe.

Question 11

In any case, since you need to keep the previous checkpoint in place while writing the current one (to have at least one copy remaining at all times), you might as well leave the previous file there until you're ready to start writing the next one.

Question 12

You can catch the SIGINT signal that is sent to the process when Ctrl-C is pressed using a signal handler. Then the process isn't killed immediately, but the signal handler is called. In the signal handler you can then write the results to a file. This is the general idea, in practice you may have some finer details to take care of.

Question 13

You shouldn't do I/O in a signal handler. In general you shouldn't do anything beyond setting a flag.

Question 14

EJP is right, you should only, for example, set a global variable exiting to true. This flag should then be checked at regular intervals, and if true, the appropriate action should be take to save the state to a disk file. The challenge here is that the program should check the flag at sufficiently regular intervals, so that you don't miss the boat if the system is being shut down.

ilkkachu ilkkachu 148k16 gold badges268 silver badges440 bronze badges · Accepted Answer · 2016-10-16 18:35:53Z

It depends a bit on if you care only about the program itself crashing, or the whole system crashing.

In the first case, you could write the fresh data to a new file, and then rename that to the real name only after you're done writing. That way the file will contain either the previous, or the new checkpoint data, but never only partial information. Though partial writes should be rare enough in any case, if we assume the checkpointing code itself is not likely to fail, and if relevant signals are trapped to make sure the program saves a new checkpoint in full before exiting. (In addition to SIGINT, I think you'd better catch SIGHUP and SIGTERM too.)

If we consider the possibility of the whole system crashing, then I wouldn't trust only one checkpoint file. The data is not likely to actually be on the disk when system returns from the file write system call. Instead, the OS and the disk itself are likely to cache the data and actually write it some time later. So leaving one or two previous checkpoints would work as a failsafe against that.

The use of more than one file is a good idea. Whether the extra burden if writing and maintaining garbage collection is worthwhile depends on the value of the computation saved and the frequency of file system corrupting system crashes.
Do you believe that sync and fsync aren't good enough to get the data written to the disk?
@G-Man, well for one, I don't trust all drives to actually write even though they might have promised. At least without battery-backup for the cache (or the whole drive). Second, after the O_PONIES fiasco, I'm a bit disillusioned about crash-safety. But yeah, if you have battery backup, trust your OS and remember to fsync both the new file and the containing directory after renaming a new file into place, then yeah, you might be safe.
In any case, since you need to keep the previous checkpoint in place while writing the current one (to have at least one copy remaining at all times), you might as well leave the previous file there until you're ready to start writing the next one.

Stack Exchange Network

Best way to create "interruptable" executable

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Best way to create "interruptable" executable

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions