python string manipulation and processing

Question 1

I have a number of codes which I need to process, and these come through in a number of different formats which I need to manipulate first to get them in the right format:

Examples of codes:

ABC1.12 - correct format
ABC 1.22 - space between letters and numbers
ABC1.12/13 - 2 codes joined together and leading 1. missing from 13, should be ABC1.12 and ABC1.13 
ABC 1.12 / 1.13 - codes joined together and spaces

I know how to remove the spaces but am not sure how to handle the codes which have been split. I know I can use the split function to create 2 codes but not sure how I can then append the letters (and first number part) to the second code. This is the 3rd and 4th example in the list above.

WHAT I HAVE SO FAR

 val = # code
 retList = [val]
 if "/" in val:
 (code1, code2) = session_codes = val.split("/", 1)
 (inital_letters, numbers) = code1.split(".", 1)
 if initial_letters not in code2:
 code2 = initial_letters + '.' + code2
 # reset list so that it returns both values 
 retList = [code1, code2]

This won't really handle the splits for 4 as the code2 becomes ABC1.1.13

Question 2

@John do you know that all numbers of the form AAA 12.3/66 should be interpreted as AAA: 12.3 and AAA:1.66? How do you know that the "leading one" is stripped from the 66?

Question 3

if there is a dot in the numbered part of the string then both sides should start with the number(s) before the dot followed by a dot followed by the second set of numbers. e.g. XX1.11/12 would always be XX1.11 and XX1.12 and not XX1.11 and XX12. If no dot in the string then we can assume no leading number e.g. EFG10/12 would be EFG10 and EFG20

Question 4

You can use regex for this purpose

A possible implementation would be as follows

>>> def foo(st):
 parts=st.replace(' ','').split("/")
 parts=list(re.findall("^([A-Za-z]+)(.*)$",parts[0])[0])+parts[1:]
 parts=parts[0:1]+[x.split('.') for x in parts[1:]]
 parts=parts[0:1]+['.'.join(x) if len(x) > 1 else '.'.join([parts[1][0],x[0]]) for x in parts[1:]]
 return [parts[0]+p for p in parts[1:]]
>>> foo('ABC1.12')
['ABC1.12']
>>> foo('ABC 1.22')
['ABC1.22']
>>> foo('ABC1.12/13')
['ABC1.12', 'ABC1.13']
>>> foo('ABC 1.12 / 1.13')
['ABC1.12', 'ABC1.13']
>>>

Question 5

Thanks this is almost perfect. The only one which seems to be wrong is the ABC1.12/13. I would like ABC1.13 rather than just ACB13

Question 6

see answer above for a detailed explanation

Question 7

Are you familiar with regex? That would be an angle worth exploring here. Also, consider splitting on the space character, not just the slash and decimal.

Question 8

I suggest you write a regular expression for each code pattern and then form a larger regular expression which is the union of the individual ones.

Question 9

Using PyParsing

The answer by @Abhijit is a good, and for this simple problem reg-ex may be the way to go. However, when dealing with parsing problems, you'll often need a more extensible solution that can grow with your problem. I've found that pyparsing is great for that, you write the grammar it does the parsing:

from pyparsing import *
index = Combine(Word(alphas))
# Define what a number is and convert it to a float
number = Combine(Word(nums)+Optional('.'+Optional(Word(nums))))
number.setParseAction(lambda x: float(x[0]))
# What do extra numbers look like?
marker = Word('/').suppress()
extra_numbers = marker + number
# Define what a possible line could be
line_code = Group(index + number + ZeroOrMore(extra_numbers))
grammar = OneOrMore(line_code)

From this definition we can parse the string:

S = '''ABC1.12
ABC 1.22
XXX1.12/13/77/32.
XYZ 1.12 / 1.13
'''
print grammar.parseString(S)

Giving:

[['ABC', 1.12], ['ABC', 1.22], ['XXX', 1.12, 13.0, 77.0, 32.0], ['XYZ', 1.12, 1.13]]

Advantages:

The number is now in the correct format, as we've type-casted them to floats during the parsing. Many more "numbers" are handled, look at the index "XXX", all numbers of type 1.12, 13, 32. are parsed, irregardless of decimal.

Question 10

Take a look at this method. The might be the simple and yet best way to do.

val = unicode(raw_input())
for aChar in val:
 if aChar.isnumeric():
 lastIndex = val.index(aChar)
 break
part1 = val[:lastIndex].strip()
part2 = val[lastIndex:]
if "/" not in part2:
 print part1+part2
else:
 if " " not in part2:
 codes = []
 divPart2 = part2.split(".")
 partCodes = divPart2[1].split("/")
 for aPart in partCodes:
 codes.append(part1+divPart2[0]+"."+aPart)
 print codes
 else:
 codes = []
 divPart2 = part2.split("/")
 for aPart in divPart2:
 aPart = aPart.strip()
 codes.append(part1+aPart)
 print codes

Abhijit 64k20 gold badges143 silver badges209 bronze badges · Accepted Answer · 2012-03-26 13:25:38Z

You can use regex for this purpose

A possible implementation would be as follows

>>> def foo(st):
 parts=st.replace(' ','').split("/")
 parts=list(re.findall("^([A-Za-z]+)(.*)$",parts[0])[0])+parts[1:]
 parts=parts[0:1]+[x.split('.') for x in parts[1:]]
 parts=parts[0:1]+['.'.join(x) if len(x) > 1 else '.'.join([parts[1][0],x[0]]) for x in parts[1:]]
 return [parts[0]+p for p in parts[1:]]
>>> foo('ABC1.12')
['ABC1.12']
>>> foo('ABC 1.22')
['ABC1.22']
>>> foo('ABC1.12/13')
['ABC1.12', 'ABC1.13']
>>> foo('ABC 1.12 / 1.13')
['ABC1.12', 'ABC1.13']
>>>

Thanks this is almost perfect. The only one which seems to be wrong is the ABC1.12/13. I would like ABC1.13 rather than just ACB13

CollectivesTM on Stack Overflow

python string manipulation and processing

5 Answers 5

2 Comments

Comments

Comments

Using PyParsing

Advantages:

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

Using PyParsing

Advantages:

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related