I've been writing a python script to help me clear up some files that contain outdated code, while also serving as a way to learn python. These files have sections of code surrounded with tags on their own lines, I need it to remove both the tags and everything within. While this largely applies to XML documents I am also using this script for several other file types, so XML specific solutions aren't suitable.
I have a working script already but it doesn't look like the most elegant or efficient solution by far, so I would like to see how I can improve it.
Script is called as python cleanup.py startDeleteTag stopDeleteTag
where the last two arguments are the locations where the code should be deleted and stop being deleted respectfully.
import os, sys
def clearFile(path, beginDelete, stopDelete):
path = os.path.join(path, "fileName.xml")
input = open(path, "r")
lines = input.readlines()
input.close()
currentFlag = False
nextFlag = False
output = open(path, "w")
for line in lines:
if nextFlag == True:
nextFlag = False
deleteFlag = False
if beginDelete in line:
deleteFlag = True
elif stopDelete in line:
nextFlag = True
if deleteFlag == False:
output.write(line)
thisDirectory = os.getcwd()
for start, dirs, files in os.walk(thisDirectory):
for f in files:
if f == "fileName.xml":
clearFile(thisDirectory, sys.argv[1], sys.argv[2])
This is the part of the script that checks each line for the start of the section to be deleted (beginDelete
to stopDelete
). Because this would leave the end tag in the file I have to use a couple of booleans to specify when the current line needs to be removed but the next line shouldn't be. As an extra, is there a way to improve this to further check for tags that may be on the same line as body of the code? This isn't currently an issue but I'm curious how I would do it if I needed to in future.
As requested, and example file input would look like this:
<good>
keep this
</good>
<bad>
delete this
</bad>
<good>
keep this
<bad>
delete this
</bad>
</good>
And this would produce the output:
<good>
keep this
</good>
<good>
keep this
</good>
3 Answers 3
Program flow
You have a lot of booleans being thrown around and you could clear out a lot. if nextFlag == True
is the same as if nextFlag
, and the latter looks neater so stick with it.
For the main flow, you have it in a bit of a mess. You should have only one flag, which denotes lines should be deleted or skipped. Your two flags are entirely confusing the process. Think about it this way, you need a flag to indicate that a part should be skipped. You don't need two values indicating the same thing. So how does that work?
Well obviously you start with looking for beginDelete
. We need to first check if deleteFlag
is True
or False
. If it's True
we want it to remain as it is until stopDelete
is found and skip. So if deleteFlag
isn't True
, then we set it to be the result of beginDelete in line
. You can assign expressions to variables, even if it's a boolean expression. So on this basis we don't need another if
check, we can just set deleteFlag
to equal this result directly. Obviously this means the flag is now True
if we've found beginDelete
but False
otherwise.
for line in lines:
if not deleteFlag:
deleteFlag = beginDelete in line
Now, we also need to find stopDelete
to turn the flag False
when it's found. This only needs to run if deleteFlag
is True
, so it should be an else
condition:
for line in lines:
if not deleteFlag:
deleteFlag = beginDelete in line
else:
deleteFlag = stopDelete not in line
continue
Now you can see that deleteFlag
is being set based on whether or not stopDelete
has been found yet. We're using not
so that deleteFlag
remains True
until stopDelete
is found. Once that happens, deleteFlag
is once again False
. However, since you don't want to write the line where you've just found stopDelete
then you should use continue
to skip to the next iteration of the line. continue
is a keyword that tells a loop to immediately go to the next iteration, ignoring any remaining code in the loop's block.
Now, writing to the file is easy. You just want to write if not deleteFlag
, ie. if the flag isn't True
.
if not deleteFlag:
output.write(line)
Style Notes
Stick to 4 space indentation. It's the Python standard and far more readable for people. Speaking of standards, Python naming uses snake_case, so beginDelete
should be begin_delete
etc.
Your names could be better too. You're not really deleting anything. You're 'skipping' or 'ignoring'. You don't really need to include flag
in the name, using ignore
would be pretty clear. Also avoid using input
as that shadows the builtin Python method of the same name. Instead input_file
or in_file
. With such a brief usage, you can also use f
.
Speaking of files, you should use with
when opening files. It's a context manager that makes file manipulation safer. When using open
you can easily run into trouble if you don't close
the file. Sometimes that can happen due to errors, or if you forget to. Which it looks like you actually did with output
. Using with
means that a file is always automatically closed, even in the case of errors or exceptions, so it's a more reliable method. This is how you'd use it for lines
.
with open(path, "r") as f:
f.readlines()
The indentation signals how long to leave the file open, and then it's closed once you move out of that block. You should do the same for output
, also open
defaults to opening in "r"
mode so you don't need to specify the argument there.
Also in your final os.walk
loop you assign start
and dirs
but don't seem to use them. If you don't need those values, consider using the names _
and __
. That's a Python way of saying that variables are unused throwaways, so people reading your code know they don't matter.
-
\$\begingroup\$ Just to note, when comparing to singletons,
True
,False
,None
, you should useis
. Soif nextFlag == True
should beif nextFlag is True
. There is then the exception SuperBiasedMan points out, that Booleans, a sub-set of Singletons, should be compared implicitly. E.g.if nextFlag
. So when you compare to sayNone
you should useif myVar is None
. \$\endgroup\$2015年11月04日 18:47:16 +00:00Commented Nov 4, 2015 at 18:47 -
2\$\begingroup\$ You can also turn
deleteFlag
into an integer to account for nested flags, just in case. Better be safe than sorry. Something likedeleteFlag += int(beginDelete in line); if deleteFlag: deleteFlag -= int(stopDelete in line); else: output.write(line)
. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2015年11月04日 19:11:44 +00:00Commented Nov 4, 2015 at 19:11 -
\$\begingroup\$ @MathiasEttinger This is good advice, though you'd also want to adjust how you search for closing tags. If you're going to account for nesting you should also account for having the opening and closing tags on the same line. \$\endgroup\$SuperBiasedMan– SuperBiasedMan2015年11月04日 20:15:37 +00:00Commented Nov 4, 2015 at 20:15
-
\$\begingroup\$ Had that in mind too but not enought space inside that comment box to phrase it well. \$\endgroup\$301_Moved_Permanently– 301_Moved_Permanently2015年11月04日 20:24:06 +00:00Commented Nov 4, 2015 at 20:24
-
\$\begingroup\$ Thanks for the assistance, without knowing more than the very basics of python meant I couldn't find a way to also delete the ending tag without using multiple flags,
continue
is exactly what I need. I'm also implementing the other changes you recommended, again these largely seem to be slightly more advanced syntax that I didn't recognise. \$\endgroup\$Psycrow– Psycrow2015年11月05日 09:42:47 +00:00Commented Nov 5, 2015 at 9:42
Iterating
Flag variables make stylistically poor code: you test for some condition, set some flag, with the understanding that some other code will later check the flag and vary its behaviour as a result. Wherever possible, it would be better to just do what you need to do.
This is a particularly tricky loop to reformulate. Since you are trying to skip over some items as you iterate, I'd drop the for ... in ...
loop so that we can access the underlying iterator directly.
Decomposing the problem
Splitting off some of the work into a clearFile()
function is a good idea. However, clearFile()
is still quite complex: not only does it need to figure out what lines to include or exclude, it is also responsible for assembling the filename, opening the file for reading, and replacing the file's contents.
Just figuring out which lines to include or exclude is a complex enough problem that it deserves to be its own function. I'd make it a generator: the function yield
s lines to be included as it encounters them. The caller can decide what to do with those lines.
Another sign that the responsibilities are suboptimally divided is that "fileName.xml"
is hard-coded in two places.
Input / output
Instead of calling open()
twice on the file, just open()
it once in 'r+'
mode.
You can just use '.'
to refer to the current directory.
Style
PEP 8 specifies four spaces for indentation. Considering that whitespace is significant in Python, this is an important convention to follow.
PEP 8 also states that under_scores
are preferred.
clearFile()
suggests a more drastic operation than just redacting lines between markers.
In my personal opinion, the names beginDelete
and stopDelete
feel mismatched. more natural word pairs are begin–end and start–stop.
Suggested solution
import os
import sys
def filter_lines(f, start_delete, stop_delete):
"""
Given a file handle, generate all lines except those between the specified
text markers.
"""
lines = iter(f)
try:
while True:
line = next(lines)
if start_delete in line:
# Discard all lines up to and including the stop marker
while stop_delete not in line:
line = next(lines)
line = next(lines)
yield line
except StopIteration:
return
def redact_files(start_delete, stop_delete):
"""
Edit all files named "fileName.xml" under the current directory, stripping
out text between the specified text markers.
"""
for dirpath, dirnames, filenames in os.walk('.'):
if 'fileName.xml' in filenames:
with open(os.path.join(dirpath, 'fileName.xml'), 'r+') as f:
filtered = list(filter_lines(f, start_delete, stop_delete))
f.seek(0)
f.writelines(filtered)
f.truncate()
if __name__ == '__main__':
redact_files(sys.argv[1], sys.argv[2])
As @SuperBiasedMan pointed out, you only need one flag. The logic for this flag can be simplified to just this:
for line in lines:
if beginDelete in line:
deleteFlag = True
if stopDelete in line:
deleteFlag = False
(You can set and unset the flag without checking what its current state is.)
beginDelete
andstopDelete
(assuming'<bad>'
and'</bad>'
from your input file, but who knows). I would also bet thatfile
has nothing to do with the builtin, so what is it? \$\endgroup\$file
tooutout
, it's simply the file that the output is being written too. As stated in the explanation thebeginDelete
andstopDelete
are the tags that surround the lines that should be deleted (this tag is a variable that is specified when the script starts). \$\endgroup\$path
,beginDelete
,stopDelete
, andoutput
are parameters? Or is it just a part of your module, verbatim, and these variables are defined earlier? In short, as @SuperBiasedMan said, give us context. \$\endgroup\$clearFile
function that I feel could be greatly improved \$\endgroup\$output
withfile
in my last edit, corrected. I don't know what you mean about the code not executing, that part of the code is copy pasted (just with comments removed) and it works as intended, I just want a more concise way of writing it \$\endgroup\$