I have a text editor in C does not give much thought to losing data whilst overwriting the original file to save new progress. To circumvent this problem, and make the process robust, I ended up with some simple algorithms to load and save the file.
For loading the file (assume file.txt
exists):
- Delete backup files.
- Write the new content to a temporary file
file.txt.bak
. - Rename the temporary file
file.txt.bak
tofile.txt.bak.bak
to test rename priveleges. - Rename the original file
file.txt
tofile.txt.bak.bak.bak
. - Rename
file.txt.bak.bak
tofile.txt
. - Remove
file.txt.bak.bak.bak
.
In psuedo-code (child of C and Python), it looks like this:
bool save_file(file.txt)
{
remove(file.bak);
if not copyfile(file.txt, file.bak):
return false; // No problem, original file (file.txt) intact.
// Test rename priveleges
if not rename(file.bak, file.bak.bak):
return false; // No problem, original file (file.txt) + backup (file.bak) intact.
if not rename(file.txt, file.bak.bak.bak):
return false; // No problem, original file (file.txt) + backup (file.bak.bak) intact.
if not rename(file.bak.bak, file.txt):
return false; // No problem, original file (file.bak.bak.bak) + backup (file.bak.bak) intact.
remove(file.bak.bak.bak); // original file now removed. No backups remaining.
return true;
}
For saving the file:
- Check whether
file.bak
orfile.bak.bak
exists. - If neither exist, we have nothing to do.
- If one does, we most likely have unsaved progress. So prompt the user and ask whether the file should be replaced.
- If both do, pick the file with the most recently updated
st_mtime
, delete the other backup, and prompt as was done in the step before. - If the user responds with a yes, rename the file and then proceed to read it normally.
- Else do nothing, and proceed to read the file normally.
In psuedo-code, this is what it'd look like:
bool load_file(file.txt)
{
// If the user deliberately removes the backups between these checks and the following
// operations and loses some data, it is not our fault.
ideal_backup = get_ideal_backup();
if ideal_backup:
should_rename = prompt("Found a backup for 'file.txt'. Do you want to replace 'file.txt' with 'file...' Y/N?");
if should_rename:
rename(ideal_backup, file.txt):
// Do the reading of file.txt et cetera now.
return true;
}
Code:
The above would have been quite time-consuming to implement directly in C, so I resorted to Python, and ended up with this:
#!/usr/bin/env python3
import os
import sys
import shutil
def get_ideal_backup(fname: str) -> str | None:
first_bac = f"{fname}.bak"
second_bac = f"{fname}.bak.bak"
file_bak_exists = os.path.isfile(first_bac)
file_bak_bak_exists = os.path.isfile(second_bac)
if not (file_bak_exists or file_bak_bak_exists):
return None
ideal_backup = None
if file_bak_exists:
if file_bak_bak_exists:
if os.stat(second_bac).st_mtime_ns > os.stat(first_bac).st_mtime_ns:
os.remove(first_bac)
ideal_backup = second_bac
else:
os.remove(second_bac)
ideal_backup = first_bac
else:
ideal_backup = first_bac
return second_bac if not ideal_backup else ideal_backup
def load_file(fname: str) -> bool:
ideal_backup = get_ideal_backup(fname)
if ideal_backup:
# Now we can prompt and ask if it should be renamed.
print(f"[INFO] - Ideal backup is {ideal_backup}.\n")
else:
print("No backup found.\n")
# We are free to read fname now
return True
def remove_backup_silently(fname: str) -> None:
try:
os.remove(fname)
except FileNotFoundError:
pass
def save_file(fname: str) -> bool:
# We do not care if the backups do not exist.
remove_backup_silently(f"{fname}.bak")
remove_backup_silently(f"{fname}.bak.bak")
remove_backup_silently(f"{fname}.bak.bak.bak")
shutil.copyfile(fname, f"{fname}.bak")
# Pedantic - test rename priveleges
os.rename(f"{fname}.bak", f"{fname}.bak.bak")
os.rename(fname, f"{fname}.bak.bak.bak")
os.rename(f"{fname}.bak.bak", fname)
os.remove(f"{fname}.bak.bak.bak")
return True
def main() -> None:
if len(sys.argv) != 2:
print("error: no filename specified.")
print("[INFO] - Filename: {sys.argv[1]}.")
load_file(sys.argv[1])
save_file(sys.argv[1])
if __name__ == "__main__":
main()
It does not actually prompt for replacing the original file with the backup found or attempt to read the file, as that was not the main focus of the code.
Review Request:
Comparing to blindly overwriting the original file, I find much improvement with the above code, but there's always root for improvement.
What problems do you see with the algorithm? How can it be made more robust? Assuming it can, would it be worth the added complexity?
The main focus of this review is on the algorithm, and not my Python skills, though you're more than welcome to teach me a new thing or two.
And if it is relevant, the text editor only cares about UNIX-like systems, not Windows or the like.
2 Answers 2
short names
Rather than having N occurences of ".bak", consider tacking on a single digit: .bakN
No need to disturb the appearance of a multi-column file listing
with very long names.
Besides, some filesystems, and APIs for accessing filesystems,
impose limits on length of name or length of full pathname.
For example a tar
variant might impose such limits.
double rename?
I have often enough extolled the virtues of
rsync,
which (roughly) writes to .123readme
before
doing atomic rename to readme
.
Partial / interrupted writes, and full filesystems,
are definite dangers we should worry about.
But permission?
I'm inclined to believe that permission dangers can be probed for prior to starting a new operation. Certainly emacs gives me a heads up when I try to edit /etc/hosts as a mere mortal. (What fools these mortals be!) I have never encountered an rsync situation where double rename would have helped. It's common enough that a non-admin edits a file having group "admin", and can only create a result file having group "staff", but that seems a fair compromise. (In particular, prefer atomic rename over a risk-laden rewrite of the file.)
historic archive
VAX/VMS would hang on to old copies of files.
If a given directory had a /PURGE policy of 3
,
we might see versions
- readme;6
- readme;7
- readme;8
Since you already have done the I/O writing work, and have allocated the disk blocks, you might consider hanging on to a week's worth, or some number of, old copies. This becomes especially critical if a megabyte file is seen to shrink more than a few percent -- maybe it was a whoops!, or it could have been deliberate.
One could use dotfiles to avoid cluttering ls -l
displays.
But it's probably better to rename to some .history/
folder on the same filesystem, perhaps with subfolders
mirroring the source pathname's structure.
You definitely would want an easy way to prune
this on demand, since it's the first place a
sysad would go upon seeing a FS that is short on free space.
Retaining a backup for more than 24 hours can interact nicely with daily backups, so that even a month from now some accidentally botched file could be recovered.
ignored errors
I'm not sure I agree with the comment here.
def remove_backup_silently(fname: str) -> None:
try:
os.remove(fname)
except FileNotFoundError:
pass
def save_file(fname: str) -> bool:
# We do not care if the backups do not exist.
remove_backup_silently(f"{fname}.bak")
remove_backup_silently(f"{fname}.bak.bak")
remove_backup_silently(f"{fname}.bak.bak.bak")
Imagine the various backups are root owned, and therefore we can't remove them. Implying that those names are also not available for subsequent operations.
The post-condition we wish to make True is:
"These three filenames shall be available for use once we
successfully return."
So by DbC I
think I'd rather see a fatal raise
if there's permission trouble.
That is, when remove_backup_silently()
is
given a Path fname
, I'd like to see it silently return
in the case where not fname.exists()
.
But if it does exist, we should commit to killing it.
We should either see remove()
succeed,
or we should let it signal fatal error,
so as not to lead the caller's logic astray.
atomic rename
It's worth highlighting that there is something really beautiful about
os.rename(old, new)
Either it works or it doesn't. But in both cases, we have something "good enough" left behind, something that file readers can happily consume. And there is never a race condition, a time window, where a reader might attempt an open() only to encounter FileNotFound.
IDK, maybe the OP code offers similar guarantees. But it is complex. And as I listened to the Swedish Chef happily murmurring "bak! bak! bak!", I sometimes lost track of the safety guarantees, of whether at all times one good result or another is ready for readers.
mixing C and python
There are a few bool
returning python functions
that don't make a ton of sense, but I understand
they were stand-ins for C functions.
The distance between C++ and python is much shorter,
since C++ can throw exceptions.
The general flow you were shooting for
is "keep going, try this, try that, until failure
and then immediately return".
Which involves lots of tedious if
's in C,
and is naturally handled by fatal exception
in other languages.
In DbC terms, if a function finds it cannot deliver on a promise, cannot satisfy a post-condition, then it is as though "this never happened!" and all bets are off. Which composes nicely with callers that are trying to deliver on their own set of promises.
unique name
Some applications use "well known" filenames
when manipulating file
, for example
- vi creates
file.swp
- emacs creates
file~
- the various
.bak
files of this question's OP code
In contrast, some applications roll a random number to generate a "new", uncontested filename, never seen before on the given filesystem:
- djb's maildir scheme
- rsync's random suffix, creating e.g. a temp
.file.SjDndj23
name
Similar to the use of a GUID, the random names are race-free.
So we enjoy strong guarantees around open(new, "w")
and os.rename(new, dest)
.
Whereas the "well known" names force us to ensure
they are available before attempting to use them.
I have seen $ emacs file
and $ sudo emacs file
warring with one another, causing permission headaches.
-
\$\begingroup\$ I was reading the documentation for
os.remove()
and did not see any exception for the case where the file existed but could not be removed. Is it thatos.remove()
always succeeds in removing the file, assuming it exists? (Ignoring the fact that Windows will raise an exception if the file is open , or how UNIX will treat its storage.) If we open the man page forremove(3)
[man7.org/linux/man-pages/man3/remove.3.html], we see thatremove(3)
can fail for any of the errors specified forunlink(2)
which are a LOT. \$\endgroup\$Madagascar– Madagascar2024年06月16日 14:13:13 +00:00Commented Jun 16, 2024 at 14:13 -
\$\begingroup\$ The objective is to check whether the file existed and
os.remove()
failed to remove it. In C, I would just check the return value ofremove()
. \$\endgroup\$Madagascar– Madagascar2024年06月16日 14:18:19 +00:00Commented Jun 16, 2024 at 14:18 -
\$\begingroup\$ Re: "So we enjoy strong guarantees around
open(new, "w")
andos.rename(new, dest)
Yes, I was thinking about that too. But then how would the editor know which file to look for when searching for a backup, or if this certain backup was made by the editor, and not by some other means? \$\endgroup\$Madagascar– Madagascar2024年06月17日 01:51:24 +00:00Commented Jun 17, 2024 at 1:51 -
1\$\begingroup\$ @Harith
os.remove
may fail with anOSError
in many ways. All functions in this module raise OSError (or subclasses thereof) in the case of invalid or inaccessible file names and paths, or other arguments that have the correct type, but are not accepted by the operating system. (Note at the very top of the page you linked). It cannot always succeed at removing something OS does not allow to delete. \$\endgroup\$STerliakov– STerliakov2024年06月19日 15:23:34 +00:00Commented Jun 19, 2024 at 15:23
if not rename(file.bak.bak, file.txt):
return false; // No problem, original file (file.bak.bak.bak) + backup (file.bak.bak) intact.
remove(file.bak.bak.bak); // original file now removed. No backups remaining.
Two cases have not been considered when loading the file:
Renaming
file.bak.bak
tofile.txt
did not succeed. We are now left withfile.bak.bak.bak
andfile.bak.bak
. Inload_file()
, when we find that we havefile.bak.bak.bak
andfile.bak.bak
, we need to choosefile.bak.bak
.Removing
file.bak.bak.bak
failed. We are now left with an old version offile.txt
. This can be trivially differentiated from the second case by checking iffile.txt
is present, because if so,file.bak.bak.bak
is just an old version that can now be deleted.
if ideal_backup:
should_rename = prompt("Found a backup for 'file.txt'. Do you want to replace 'file.txt' with 'file...' Y/N?");
if should_rename:
rename(ideal_backup, file.txt):
The backup might be deleted between prompting and renaming. There are similar time windows between testing for the existence of the backups and the subsequent operations performed on them, but it is not the editor's fault if the client messes around with the backups and ends up losing data.