3
\$\begingroup\$

I'm processing strings using regexes in a bunch of files in a directory. To each line in a file, I apply a series of try-statements to match a pattern and if they do, then I transform the input. After I have analyzed each line, I write it to a new file. I have a lot of these try-else followed by if-statements (I only included two here as an illustration). My issue here is that after processing a few files, the script slows down so much that it almost stalls the process completely. I don't know what in my code is causing the slowing down but I have a feeling it is the combination of try-else + if-statements. How can I streamline the transformations so that the data is processed at a reasonable speed?

Or is it that I need a more efficient iterator that does not tax memory to the same extent?

Any feedback would be much appreciated!

import re
import glob
fileCounter = 0 
for infile in glob.iglob(r'\input-files\*.txt'):
 fileCounter += 1
 outfile = r'\output-files\output_%s.txt' % fileCounter
 with open(infile, "rb") as inList, open(outfile, "wb") as outlist:
 for inline in inlist:
 inword = inline.strip('\r\n')
 #apply some text transformations
 #Transformation #1
 try: result = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy](.*\[=\].*)*', inword).group()
 except: result = None
 if result == inword:
 inword = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', wbWord)
 #Transformation #2 etc.
 try: result = re.match('(.*\[=\].*)*(\w?\w?)[AEIOUYaąeęioóuy]\[=\][ćsśz][ptkbdg][aąeęioóuyrfw](.*\[=\].*)*', inword).group()
 except: result = None
 if result == inword: 
 inword = re.sub('(?<=[AEIOUYaąeęioóuy])\[=\](?=[ćsśz][ptkbdg][aąeęioóuyrfw])', '', inword)
 inword = re.sub('(?<=[AEIOUYaąeęioóuy][ćsśz])(?=[ptkbdg][aąeęioóuyrfw])', '[=]', inword)
 outline = inword + "\n"
 outlist.write(outline)
 print "Processed file number %s" % fileCounter 
print "*** Processing completed ***" 
BenC
2,77811 silver badges22 bronze badges
asked Sep 7, 2017 at 12:34
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$
  • It's not 100 % clear what exceptions you expect, but I presume you are trying to handle the case when the regex does not match. I suggest handling it this way:

     #Transformation #1
     match = re.match(pattern, inword)
     result = match.group() if match else None
    
  • "after processing a few files, the script slows down" Have you considered the possibility that a particular file, or even a particular line is slow to process? A possible explanation to that would be that regexes can suffer from catastrophic backtracking.

answered Sep 11, 2017 at 7:38
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.