7

I'm trying to create a function (in Python) that takes its input (a chemical formula) and splits in into a list. For example, if the input was "HC2H3O2", it would turn it into:

molecule_list = ['H', 1, 'C', 2, 'H', 3, 'O', 2]

This, works well so far, but if I input an element with two letters in it, for example sodium (Na), it would split it into:

['N', 'a']

I'm searching for a way to make my function look through the string for keys found in a dictionary called elements. I'm also considering using regex for this, but I'm not sure how to implement it. This is what my function is right now:

def split_molecule(inputted_molecule):
 """Take the input and split it into a list
 eg: C02 => ['C', 1, 'O', 2]
 """
 # step 1: convert inputted_molecule to a list
 # step 2a: if there are two periodic elements next to each other, insert a '1'
 # step 2b: if the last element is an element, append a '1'
 # step 3: convert all numbers in list to ints
 # step 1:
 # problem: it splits Na into 'N', 'a'
 # it needs to split by periodic elements
 molecule_list = list(inputted_molecule)
 # because at most, the list can double when "1" is inserted
 max_length_of_molecule_list = 2*len(molecule_list)
 # step 2a:
 for i in range(0, max_length_of_molecule_list):
 try:
 if (molecule_list[i] in elements) and (molecule_list[i+1] in elements):
 molecule_list.insert(i+1, "1")
 except IndexError:
 break
 # step2b: 
 if (molecule_list[-1] in elements):
 molecule_list.append("1")
 # step 3:
 for i in range(0, len(molecule_list)):
 if molecule_list[i].isdigit():
 molecule_list[i] = int(molecule_list[i])
 return molecule_list
asked Mar 20, 2012 at 7:27

3 Answers 3

6

How about

import re
print re.findall('[A-Z][a-z]?|[0-9]+', 'Na2SO4MnO4')

result

['Na', '2', 'S', 'O', '4', 'Mn', 'O', '4']

Regex explained:

Find everything that is either
 [A-Z] # A,B,...Z, ie. an uppercase letter
 [a-z] # followed by a,b,...z, ie. a lowercase latter
 ? # which is optional
 | # or
 [0-9] # 0,1,2...9, ie a digit
 + # and perhaps some more of them

This expression is pretty dumb since it accepts arbitrary "elements", like "Xy". You can improve it by replacing the [A-Z][a-z]? part with the actual list of elements' names, separated by |, like Ba|Na|Mn...|C|O

Of course, regular expressions can only handle very simple formulas, to parse something like

 8(NH4)3P4Mo12O40 + 64NaNO3 + 149NH4NO3 + 135H2O

you're going to need a real parser, e.g. pyparsing (be sure to check "chemical formulas" under "Examples"). Good luck!

answered Mar 20, 2012 at 7:36
2
  • That's brilliant, thank you! Would you mind explaining the regex? Commented Mar 20, 2012 at 7:39
  • +1 for mentioning that you would need a real parser, instead of a regex parser Commented Mar 20, 2012 at 10:55
2

An expression like this will match all parts of interest:

[A-Z][a-z]*|\d+

You can use it with re.findall and then add the quantifier for atoms that have none.

Or you could use a regex for that as well:

molecule = 'NaHC2H3O2'
print re.findall(r'[A-Z][a-z]*|\d+', re.sub('[A-Z][a-z]*(?![\da-z])', r'\g<0>1', molecule))

Output:

['Na', '1', 'H', '1', 'C', '2', 'H', '3', 'O', '2']

The sub adds a 1 after all atoms not followed by a number.

answered Mar 20, 2012 at 7:56
1

The non-regex approach, which is a bit hackish and probably not the best, but it works:

import string
formula = 'HC2H3O2Na'
m_list = list()
for x in formula:
 if x in string.lowercase:
 m_list.append(formula[formula.index(x)-1]+x)
 _ = m_list.pop(len(m_list)-2)
 else:
 m_list.append(x)
print m_list
['H', 'C', '2', 'H', '3', 'O', '2', 'Na']
answered Mar 20, 2012 at 7:57

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.