Python script to delete sections of text

Question 1

I've been writing a python script to help me clear up some files that contain outdated code, while also serving as a way to learn python. These files have sections of code surrounded with tags on their own lines, I need it to remove both the tags and everything within. While this largely applies to XML documents I am also using this script for several other file types, so XML specific solutions aren't suitable.

I have a working script already but it doesn't look like the most elegant or efficient solution by far, so I would like to see how I can improve it.

Script is called as python cleanup.py startDeleteTag stopDeleteTag where the last two arguments are the locations where the code should be deleted and stop being deleted respectfully.

import os, sys
def clearFile(path, beginDelete, stopDelete):
 path = os.path.join(path, "fileName.xml")
 input = open(path, "r")
 lines = input.readlines()
 input.close()
 currentFlag = False
 nextFlag = False
 output = open(path, "w")
 for line in lines:
 if nextFlag == True:
 nextFlag = False
 deleteFlag = False
 if beginDelete in line:
 deleteFlag = True
 elif stopDelete in line:
 nextFlag = True
 if deleteFlag == False:
 output.write(line)
thisDirectory = os.getcwd()
for start, dirs, files in os.walk(thisDirectory):
 for f in files:
 if f == "fileName.xml":
 clearFile(thisDirectory, sys.argv[1], sys.argv[2])

This is the part of the script that checks each line for the start of the section to be deleted (beginDelete to stopDelete). Because this would leave the end tag in the file I have to use a couple of booleans to specify when the current line needs to be removed but the next line shouldn't be. As an extra, is there a way to improve this to further check for tags that may be on the same line as body of the code? This isn't currently an issue but I'm curious how I would do it if I needed to in future.

As requested, and example file input would look like this:

<good>
 keep this
</good>
<bad>
 delete this
</bad>
<good>
 keep this
 <bad>
 delete this
 </bad>
</good>

And this would produce the output:

<good>
 keep this
</good>
<good>
 keep this
</good>

Question 2

This code is incomplete. What are beginDelete and stopDelete (assuming '<bad>' and '</bad>' from your input file, but who knows). I would also bet that file has nothing to do with the builtin, so what is it?

Question 3

I've changed file to outout, it's simply the file that the output is being written too. As stated in the explanation the beginDelete and stopDelete are the tags that surround the lines that should be deleted (this tag is a variable that is specified when the script starts).

Question 4

What I was asking was more how are they defined rather than what they are. Is your snipet of code the body of a function where path, beginDelete, stopDelete, and output are parameters? Or is it just a part of your module, verbatim, and these variables are defined earlier? In short, as @SuperBiasedMan said, give us context.

Question 5

I've added further context, hopefully this is enough to explain how the current script works. I'm more or less happy with the rest, it's the clearFile function that I feel could be greatly improved

Question 6

Sorry, overwrote output with file in my last edit, corrected. I don't know what you mean about the code not executing, that part of the code is copy pasted (just with comments removed) and it works as intended, I just want a more concise way of writing it

Question 7

Program flow

You have a lot of booleans being thrown around and you could clear out a lot. if nextFlag == True is the same as if nextFlag, and the latter looks neater so stick with it.

For the main flow, you have it in a bit of a mess. You should have only one flag, which denotes lines should be deleted or skipped. Your two flags are entirely confusing the process. Think about it this way, you need a flag to indicate that a part should be skipped. You don't need two values indicating the same thing. So how does that work?

Well obviously you start with looking for beginDelete. We need to first check if deleteFlag is True or False. If it's True we want it to remain as it is until stopDelete is found and skip. So if deleteFlag isn't True, then we set it to be the result of beginDelete in line. You can assign expressions to variables, even if it's a boolean expression. So on this basis we don't need another if check, we can just set deleteFlag to equal this result directly. Obviously this means the flag is now True if we've found beginDelete but False otherwise.

for line in lines:
 if not deleteFlag:
 deleteFlag = beginDelete in line

Now, we also need to find stopDelete to turn the flag False when it's found. This only needs to run if deleteFlag is True, so it should be an else condition:

for line in lines:
 if not deleteFlag:
 deleteFlag = beginDelete in line
 else:
 deleteFlag = stopDelete not in line
 continue

Now you can see that deleteFlag is being set based on whether or not stopDelete has been found yet. We're using not so that deleteFlag remains True until stopDelete is found. Once that happens, deleteFlag is once again False. However, since you don't want to write the line where you've just found stopDelete then you should use continue to skip to the next iteration of the line. continue is a keyword that tells a loop to immediately go to the next iteration, ignoring any remaining code in the loop's block.

Now, writing to the file is easy. You just want to write if not deleteFlag, ie. if the flag isn't True.

 if not deleteFlag:
 output.write(line)

Style Notes

Stick to 4 space indentation. It's the Python standard and far more readable for people. Speaking of standards, Python naming uses snake_case, so beginDelete should be begin_delete etc.

Your names could be better too. You're not really deleting anything. You're 'skipping' or 'ignoring'. You don't really need to include flag in the name, using ignore would be pretty clear. Also avoid using input as that shadows the builtin Python method of the same name. Instead input_file or in_file. With such a brief usage, you can also use f.

Speaking of files, you should use with when opening files. It's a context manager that makes file manipulation safer. When using open you can easily run into trouble if you don't close the file. Sometimes that can happen due to errors, or if you forget to. Which it looks like you actually did with output. Using with means that a file is always automatically closed, even in the case of errors or exceptions, so it's a more reliable method. This is how you'd use it for lines.

with open(path, "r") as f:
 f.readlines()

The indentation signals how long to leave the file open, and then it's closed once you move out of that block. You should do the same for output, also open defaults to opening in "r" mode so you don't need to specify the argument there.

Also in your final os.walk loop you assign start and dirs but don't seem to use them. If you don't need those values, consider using the names _ and __. That's a Python way of saying that variables are unused throwaways, so people reading your code know they don't matter.

Question 8

Just to note, when comparing to singletons, True, False, None, you should use is. So if nextFlag == True should be if nextFlag is True. There is then the exception SuperBiasedMan points out, that Booleans, a sub-set of Singletons, should be compared implicitly. E.g. if nextFlag. So when you compare to say None you should use if myVar is None.

Question 9

You can also turn deleteFlag into an integer to account for nested flags, just in case. Better be safe than sorry. Something like deleteFlag += int(beginDelete in line); if deleteFlag: deleteFlag -= int(stopDelete in line); else: output.write(line).

Question 10

@MathiasEttinger This is good advice, though you'd also want to adjust how you search for closing tags. If you're going to account for nesting you should also account for having the opening and closing tags on the same line.

Question 11

Had that in mind too but not enought space inside that comment box to phrase it well.

Question 12

Thanks for the assistance, without knowing more than the very basics of python meant I couldn't find a way to also delete the ending tag without using multiple flags, continue is exactly what I need. I'm also implementing the other changes you recommended, again these largely seem to be slightly more advanced syntax that I didn't recognise.

Question 13

Iterating

Flag variables make stylistically poor code: you test for some condition, set some flag, with the understanding that some other code will later check the flag and vary its behaviour as a result. Wherever possible, it would be better to just do what you need to do.

This is a particularly tricky loop to reformulate. Since you are trying to skip over some items as you iterate, I'd drop the for ... in ... loop so that we can access the underlying iterator directly.

Decomposing the problem

Splitting off some of the work into a clearFile() function is a good idea. However, clearFile() is still quite complex: not only does it need to figure out what lines to include or exclude, it is also responsible for assembling the filename, opening the file for reading, and replacing the file's contents.

Just figuring out which lines to include or exclude is a complex enough problem that it deserves to be its own function. I'd make it a generator: the function yields lines to be included as it encounters them. The caller can decide what to do with those lines.

Another sign that the responsibilities are suboptimally divided is that "fileName.xml" is hard-coded in two places.

Input / output

Instead of calling open() twice on the file, just open() it once in 'r+' mode.

You can just use '.' to refer to the current directory.

Style

PEP 8 specifies four spaces for indentation. Considering that whitespace is significant in Python, this is an important convention to follow.

PEP 8 also states that under_scores are preferred.

clearFile() suggests a more drastic operation than just redacting lines between markers.

In my personal opinion, the names beginDelete and stopDelete feel mismatched. more natural word pairs are begin–end and start–stop.

Program flow

You have a lot of booleans being thrown around and you could clear out a lot. if nextFlag == True is the same as if nextFlag, and the latter looks neater so stick with it.

For the main flow, you have it in a bit of a mess. You should have only one flag, which denotes lines should be deleted or skipped. Your two flags are entirely confusing the process. Think about it this way, you need a flag to indicate that a part should be skipped. You don't need two values indicating the same thing. So how does that work?

Well obviously you start with looking for beginDelete. We need to first check if deleteFlag is True or False. If it's True we want it to remain as it is until stopDelete is found and skip. So if deleteFlag isn't True, then we set it to be the result of beginDelete in line. You can assign expressions to variables, even if it's a boolean expression. So on this basis we don't need another if check, we can just set deleteFlag to equal this result directly. Obviously this means the flag is now True if we've found beginDelete but False otherwise.

for line in lines:
 if not deleteFlag:
 deleteFlag = beginDelete in line

Now, we also need to find stopDelete to turn the flag False when it's found. This only needs to run if deleteFlag is True, so it should be an else condition:

for line in lines:
 if not deleteFlag:
 deleteFlag = beginDelete in line
 else:
 deleteFlag = stopDelete not in line
 continue

Now you can see that deleteFlag is being set based on whether or not stopDelete has been found yet. We're using not so that deleteFlag remains True until stopDelete is found. Once that happens, deleteFlag is once again False. However, since you don't want to write the line where you've just found stopDelete then you should use continue to skip to the next iteration of the line. continue is a keyword that tells a loop to immediately go to the next iteration, ignoring any remaining code in the loop's block.

Now, writing to the file is easy. You just want to write if not deleteFlag, ie. if the flag isn't True.

 if not deleteFlag:
 output.write(line)

Style Notes

Stick to 4 space indentation. It's the Python standard and far more readable for people. Speaking of standards, Python naming uses snake_case, so beginDelete should be begin_delete etc.

Your names could be better too. You're not really deleting anything. You're 'skipping' or 'ignoring'. You don't really need to include flag in the name, using ignore would be pretty clear. Also avoid using input as that shadows the builtin Python method of the same name. Instead input_file or in_file. With such a brief usage, you can also use f.

Speaking of files, you should use with when opening files. It's a context manager that makes file manipulation safer. When using open you can easily run into trouble if you don't close the file. Sometimes that can happen due to errors, or if you forget to. Which it looks like you actually did with output. Using with means that a file is always automatically closed, even in the case of errors or exceptions, so it's a more reliable method. This is how you'd use it for lines.

with open(path, "r") as f:
 f.readlines()

The indentation signals how long to leave the file open, and then it's closed once you move out of that block. You should do the same for output, also open defaults to opening in "r" mode so you don't need to specify the argument there.

Also in your final os.walk loop you assign start and dirs but don't seem to use them. If you don't need those values, consider using the names _ and __. That's a Python way of saying that variables are unused throwaways, so people reading your code know they don't matter.

Just to note, when comparing to singletons, True, False, None, you should use is. So if nextFlag == True should be if nextFlag is True. There is then the exception SuperBiasedMan points out, that Booleans, a sub-set of Singletons, should be compared implicitly. E.g. if nextFlag. So when you compare to say None you should use if myVar is None.
You can also turn deleteFlag into an integer to account for nested flags, just in case. Better be safe than sorry. Something like deleteFlag += int(beginDelete in line); if deleteFlag: deleteFlag -= int(stopDelete in line); else: output.write(line).
@MathiasEttinger This is good advice, though you'd also want to adjust how you search for closing tags. If you're going to account for nesting you should also account for having the opening and closing tags on the same line.
Had that in mind too but not enought space inside that comment box to phrase it well.
Thanks for the assistance, without knowing more than the very basics of python meant I couldn't find a way to also delete the ending tag without using multiple flags, continue is exactly what I need. I'm also implementing the other changes you recommended, again these largely seem to be slightly more advanced syntax that I didn't recognise.

200_success 200_success 146k22 gold badges190 silver badges479 bronze badges · Answer 2 · 2015-11-04 19:54:17Z

Iterating

Flag variables make stylistically poor code: you test for some condition, set some flag, with the understanding that some other code will later check the flag and vary its behaviour as a result. Wherever possible, it would be better to just do what you need to do.

This is a particularly tricky loop to reformulate. Since you are trying to skip over some items as you iterate, I'd drop the for ... in ... loop so that we can access the underlying iterator directly.

Decomposing the problem

Splitting off some of the work into a clearFile() function is a good idea. However, clearFile() is still quite complex: not only does it need to figure out what lines to include or exclude, it is also responsible for assembling the filename, opening the file for reading, and replacing the file's contents.

Just figuring out which lines to include or exclude is a complex enough problem that it deserves to be its own function. I'd make it a generator: the function yields lines to be included as it encounters them. The caller can decide what to do with those lines.

Another sign that the responsibilities are suboptimally divided is that "fileName.xml" is hard-coded in two places.

Input / output

Instead of calling open() twice on the file, just open() it once in 'r+' mode.

You can just use '.' to refer to the current directory.

Style

PEP 8 specifies four spaces for indentation. Considering that whitespace is significant in Python, this is an important convention to follow.

PEP 8 also states that under_scores are preferred.

clearFile() suggests a more drastic operation than just redacting lines between markers.

In my personal opinion, the names beginDelete and stopDelete feel mismatched. more natural word pairs are begin–end and start–stop.

Stack Exchange Network

Python script to delete sections of text

3 Answers 3

Program flow

Style Notes

Iterating

Decomposing the problem

Input / output

Style

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script to delete sections of text

3 Answers 3

Program flow

Style Notes

Iterating

Decomposing the problem

Input / output

Style

Suggested solution

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions