The job of my Unix executable file is to perform a long computation, and I added a interrupt/resume functionality to it as explained below.
At regular intervals, the program writes all relevant data found so far in a checkpoint file, which can then be used as a starting point for a "resume" operation.
To interrupt the program, I use Ctrl+C.
The only problem with this methodology is that, if the interruption occurs
when the program is writing into the file, I am left with a useless half written file.
The only fix I could find so far is as follows:
- make the program
- write into two files, so that at restart time one of them will be readable.
Is there a cleaner, better way to create an "interruptable" Unix executable ?
-
2The thing you're asking about is called "checkpointing" so adding that word somewhere should allow future users to identify your question as relevant more easily.James Youngman– James Youngman2016年10月17日 08:28:43 +00:00Commented Oct 17, 2016 at 8:28
-
@JamesYoungman Unfortunately, "checkpointing" is not a tag here, and I do not have enough ratings to create a new tagEwan Delanoy– Ewan Delanoy2016年10月17日 08:33:11 +00:00Commented Oct 17, 2016 at 8:33
-
1Your main concern: "when the program is writing into the file, I am left with a useless half written file" can be addressed by writing a temporary file and then atomically replacing the target file with it. This ensures data at the expected location is always consistent.phg– phg2016年10月17日 08:52:19 +00:00Commented Oct 17, 2016 at 8:52
-
@phg I must confess that I do not really know what the "atomic" operations are in Unix. Could you clarify what you mean by "atomically replacing" ?Ewan Delanoy– Ewan Delanoy2016年10月17日 11:27:48 +00:00Commented Oct 17, 2016 at 11:27
-
rcrowley.org/2010/01/06/things-unix-can-do-atomically.htmlphg– phg2016年10月17日 11:50:08 +00:00Commented Oct 17, 2016 at 11:50
2 Answers 2
It depends a bit on if you care only about the program itself crashing, or the whole system crashing.
In the first case, you could write the fresh data to a new file, and then rename that to the real name only after you're done writing. That way the file will contain either the previous, or the new checkpoint data, but never only partial information. Though partial writes should be rare enough in any case, if we assume the checkpointing code itself is not likely to fail, and if relevant signals are trapped to make sure the program saves a new checkpoint in full before exiting. (In addition to SIGINT
, I think you'd better catch SIGHUP
and SIGTERM
too.)
If we consider the possibility of the whole system crashing, then I wouldn't trust only one checkpoint file. The data is not likely to actually be on the disk when system returns from the file write system call. Instead, the OS and the disk itself are likely to cache the data and actually write it some time later. So leaving one or two previous checkpoints would work as a failsafe against that.
-
The use of more than one file is a good idea. Whether the extra burden if writing and maintaining garbage collection is worthwhile depends on the value of the computation saved and the frequency of file system corrupting system crashes.James Youngman– James Youngman2016年10月17日 08:31:18 +00:00Commented Oct 17, 2016 at 8:31
-
Do you believe that
sync
andfsync
aren't good enough to get the data written to the disk?G-Man Says 'Reinstate Monica'– G-Man Says 'Reinstate Monica'2016年10月20日 05:51:25 +00:00Commented Oct 20, 2016 at 5:51 -
@G-Man, well for one, I don't trust all drives to actually write even though they might have promised. At least without battery-backup for the cache (or the whole drive). Second, after the O_PONIES fiasco, I'm a bit disillusioned about crash-safety. But yeah, if you have battery backup, trust your OS and remember to
fsync
both the new file and the containing directory after renaming a new file into place, then yeah, you might be safe.ilkkachu– ilkkachu2016年10月20日 09:54:38 +00:00Commented Oct 20, 2016 at 9:54 -
1In any case, since you need to keep the previous checkpoint in place while writing the current one (to have at least one copy remaining at all times), you might as well leave the previous file there until you're ready to start writing the next one.ilkkachu– ilkkachu2016年10月20日 10:00:27 +00:00Commented Oct 20, 2016 at 10:00
You can catch the SIGINT
signal that is sent to the process when Ctrl-C
is pressed using a signal handler. Then the process isn't killed immediately, but the signal handler is called. In the signal handler you can then write the results to a file. This is the general idea, in practice you may have some finer details to take care of.
-
3You shouldn't do I/O in a signal handler. In general you shouldn't do anything beyond setting a flag.user207421– user2074212016年10月16日 23:33:28 +00:00Commented Oct 16, 2016 at 23:33
-
EJP is right, you should only, for example, set a global variable
exiting
totrue
. This flag should then be checked at regular intervals, and if true, the appropriate action should be take to save the state to a disk file. The challenge here is that the program should check the flag at sufficiently regular intervals, so that you don't miss the boat if the system is being shut down.Johan Myréen– Johan Myréen2016年10月21日 08:11:37 +00:00Commented Oct 21, 2016 at 8:11