String parsing with multiple delimeters

Question 1

My data is in this format:

龍舟龙舟 [long2 zhou1] /dragon boat/imperial boat/\n

And I want to return:

('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')

In C I could do this in one line with sscanf, but I seem to be f̶a̶i̶l̶i̶n̶g̶ writing code like a schoolkid with Python:

 working = line.rstrip().split(" ")
 trad, simp = working[0], working[1]
 working = " ".join(working[2:]).split("]")
 pinyin = working[0][1:]
 english = working[1][1:]
 return trad, simp, pinyin, english

Can I improve?

Question 2

This is a little hard to parse because the logical field separator, the space character, is also a valid character inside the last two fields. This is disambiguated with brackets and slashes, but that obviously make the parse harder and uglier.

Question 3

If your code doesn't work correctly, then this question is off topic here. See the FAQ.

Question 4

@svick it works perfectly - by "failing" I meant "failing to write neat code"

Question 5

You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).

For Python 2.7.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')

For Python 3.2.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

Question 6

groups() and groupdict() are great!

Question 7

My goal here is clarity above all else. I think a first step is to use the maxsplit argument of split to get the first two pieces and the remainder:

trad, simp, remainder = line.rstrip().split(' ', 2)

Now, to parse the leftovers I'm afraid I only see slightly ugly choices. Some people like regular expressions and others hate them. Without regular expressions, I think it's easiest to view the remainder as two field separated with "] "

pinyin, english = remainder.split("] ")
pinyin = pinyin[1:] # get rid of leading '['

Question 8

I would split around the square brackets first

def parse_string(s):
 a, b = s.rstrip().split(' [', 2)
 return a.split(' ') + b.split('] ', 2)

or more explicitly

def parse_string(s):
 first, rest = s.rstrip().split(' [', 2)
 trad, simp = first.split(' ', 2)
 pinyin, english = rest.split('] ', 2)
 return trad, simp, pinyin, english

Question 9

You could perhaps try using the parse module, which you need to download from pypi. It's intended to function in the opposite manner from the format method.

>>> import parse
>>> data = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> match = "{} {} [{}] {}\n"
>>> result = parse.parse(match, data)
>>> print result
<Result ('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/') {}>
>>> print result[0] 
'龍舟'

If you want to be able to access the result as a dictionary, you could name each of the parameters inside the brackets:

>>> import parse
>>> data = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> match = "{trad} {simp} [{pinyin}] {english}\n"
>>> result = parse.parse(match, data)
>>> print result
<Result () {'english': '/dragon boat/imperial boat/', 'trad': '龍舟', 'simp': '龙舟', 'pinyin': 'long2 zhou1'}>
>>> print result['trad'] 
'龍舟'

Question 10

Is there some way to specify that the first two groups shouldn't contain spaces? Because I think unexpected input should cause an error, not silently return unexpected results.

hdima hdima 3661 silver badge3 bronze badges · Accepted Answer · 2013-02-10 14:05:11Z

You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).

For Python 2.7.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')

For Python 3.2.3:

>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')

\$\begingroup\$ groups() and groupdict() are great! \$\endgroup\$

jsj
– jsj

2013年02月10日 18:21:43 +00:00
Commented Feb 10, 2013 at 18:21

Stack Exchange Network

String parsing with multiple delimeters

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

String parsing with multiple delimeters

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions