Efficient use of regular expression and string manipulation

Question 1

The following is my solution to Java vs C++. I think the way I have used the re library is inefficient, and possible erroneous as I am getting tle.

import sys
import re
cpp = re.compile("^[a-z]([a-z_]*[a-z])*$")
java= re.compile("^[a-z][a-zA-Z]*$")
r = re.compile("[A-Z][a-z]*")
lines = [line for line in sys.stdin.read().splitlines() if line!=""]
for line in lines:
 if cpp.match(line) : 
 line = line.split("_")
 for i in range(1,len(line)):
 line[i] = line[i].capitalize()
 print "".join(line)
 elif java.match(line): 
 namelist = r.findall(line)
 for name in namelist:
 line = line.replace(name , name.replace(name[0],"_"+name[0].lower()))
 print line
 else : print "Error!"

For instance, is there a better way to replace string components inline instead of creating a new string and copying the way I have used :

line = line.replace(name , name.replace(name[0],"_"+name[0].lower()))

or is there a way that is entirely different from my approach?

Question 2

When using the re module, make sure to use strings marked with a leading 'r' as I do in the code below. This tells the interpreter to not try and interpolate any escaped strings.

By using findall to test for matches too we end up with more succinct code that still does exactly what we want in a clear fashion.

By putting the logic in a function you get the ability to write unit tests or do something other than print the result.

Using an exception for the failure cases makes the code clear and is more Pythonic in style.

Here is how I would have coded it.

import re
import sys
cpp = re.compile(r"(_[a-z]*)")
java= re.compile(r"([A-Z]+[a-z]*)")
class InvalidStyle(Exception):
 pass
def style_transform(input):
 if input[0].isupper():
 raise InvalidStyle(input)
 m = cpp.findall(input)
 if m:
 # import pdb; pdb.set_trace()
 if any(re.match(r"^[A-Z_]$", w) for w in m):
 raise InvalidStyle(input)
 pos = input.find(m[0])
 return input[:pos] + "".join(w.capitalize() for w in m)
 m = java.findall(input)
 if m:
 # import pdb; pdb.set_trace()
 pos = input.find(m[0])
 words = [input[:pos]] + [w.lower() for w in m]
 return "_".join(words)
 if input.lower() == input:
 return input
 else:
 raise InvalidStyle(input)
if __name__ == "__main__":
 if len(sys.argv) == 2: # allows for debugging via pdb
 fp = open(sys.argv[1])
 else:
 fp = sys.stdin
 for line in fp.readlines():
 line = line.strip()
 if not line:
 continue
 try:
 print style_transform(line)
 except InvalidStyle as e:
 # print e, uncomment to see errors
 print "Error!"

Question 3

The online judge rejects your solution probably because this regular expression causes catastrophic backtracking: ^[a-z]([a-z_]*[a-z])*$. Trying to match a string of 24 lowercase letters followed by a non-matching character takes two seconds on my computer. Using this instead takes only 6 microseconds:
```
^[a-z]+(_[a-z]+)*$
```
To simplify the generation of the underscore-separated string, make the r regex recognize also the first word that does not begin in upper case:
```
r = re.compile("[A-Z]?[a-z]*")
```
Then use "_".join to construct the result. I added if s because now the regex matches also an empty string in the end.
```
print "_".join(s.lower() for s in r.findall(line) if s)
```

Sean Perry Sean Perry 1,2099 silver badges15 bronze badges · Answer 1 · 2014-05-28 18:51:29Z

When using the re module, make sure to use strings marked with a leading 'r' as I do in the code below. This tells the interpreter to not try and interpolate any escaped strings.

By using findall to test for matches too we end up with more succinct code that still does exactly what we want in a clear fashion.

By putting the logic in a function you get the ability to write unit tests or do something other than print the result.

Using an exception for the failure cases makes the code clear and is more Pythonic in style.

Here is how I would have coded it.

import re
import sys
cpp = re.compile(r"(_[a-z]*)")
java= re.compile(r"([A-Z]+[a-z]*)")
class InvalidStyle(Exception):
 pass
def style_transform(input):
 if input[0].isupper():
 raise InvalidStyle(input)
 m = cpp.findall(input)
 if m:
 # import pdb; pdb.set_trace()
 if any(re.match(r"^[A-Z_]$", w) for w in m):
 raise InvalidStyle(input)
 pos = input.find(m[0])
 return input[:pos] + "".join(w.capitalize() for w in m)
 m = java.findall(input)
 if m:
 # import pdb; pdb.set_trace()
 pos = input.find(m[0])
 words = [input[:pos]] + [w.lower() for w in m]
 return "_".join(words)
 if input.lower() == input:
 return input
 else:
 raise InvalidStyle(input)
if __name__ == "__main__":
 if len(sys.argv) == 2: # allows for debugging via pdb
 fp = open(sys.argv[1])
 else:
 fp = sys.stdin
 for line in fp.readlines():
 line = line.strip()
 if not line:
 continue
 try:
 print style_transform(line)
 except InvalidStyle as e:
 # print e, uncomment to see errors
 print "Error!"

Janne Karila Janne Karila 10.6k21 silver badges34 bronze badges · Answer 2 · 2014-05-29 17:49:32Z

The online judge rejects your solution probably because this regular expression causes catastrophic backtracking: ^[a-z]([a-z_]*[a-z])*$. Trying to match a string of 24 lowercase letters followed by a non-matching character takes two seconds on my computer. Using this instead takes only 6 microseconds:
```
^[a-z]+(_[a-z]+)*$
```
To simplify the generation of the underscore-separated string, make the r regex recognize also the first word that does not begin in upper case:
```
r = re.compile("[A-Z]?[a-z]*")
```
Then use "_".join to construct the result. I added if s because now the regex matches also an empty string in the end.
```
print "_".join(s.lower() for s in r.findall(line) if s)
```

Stack Exchange Network

Efficient use of regular expression and string manipulation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient use of regular expression and string manipulation

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions