My data is in this format:
龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n
And I want to return:
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
In C I could do this in one line with sscanf
, but I seem to be f̶a̶i̶l̶i̶n̶g̶ writing code like a schoolkid with Python:
working = line.rstrip().split(" ")
trad, simp = working[0], working[1]
working = " ".join(working[2:]).split("]")
pinyin = working[0][1:]
english = working[1][1:]
return trad, simp, pinyin, english
Can I improve?
-
\$\begingroup\$ This is a little hard to parse because the logical field separator, the space character, is also a valid character inside the last two fields. This is disambiguated with brackets and slashes, but that obviously make the parse harder and uglier. \$\endgroup\$President James K. Polk– President James K. Polk2013年02月10日 13:04:27 +00:00Commented Feb 10, 2013 at 13:04
-
1\$\begingroup\$ If your code doesn't work correctly, then this question is off topic here. See the FAQ. \$\endgroup\$svick– svick2013年02月10日 16:08:16 +00:00Commented Feb 10, 2013 at 16:08
-
3\$\begingroup\$ @svick it works perfectly - by "failing" I meant "failing to write neat code" \$\endgroup\$jsj– jsj2013年02月10日 17:07:16 +00:00Commented Feb 10, 2013 at 17:07
4 Answers 4
You can use Regular Expressions with re module. For example the following regular expression works with binary strings and Unicode string (I'm not sure which version of Python you use).
For Python 2.7.3:
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> u = s.decode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('\xe9\xbe\x8d\xe8\x88\x9f', '\xe9\xbe\x99\xe8\x88\x9f', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", u).groups()
(u'\u9f8d\u821f', u'\u9f99\u821f', u'long2 zhou1', u'/dragon boat/imperial boat/')
For Python 3.2.3:
>>> s = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> b = s.encode("utf-8")
>>> import re
>>> re.match(r"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", s).groups()
('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/')
>>> re.match(br"^([^ ]+) ([^ ]+) \[([^]]+)\] (.+)", b).groups()
(b'\xe9\xbe\x8d\xe8\x88\x9f', b'\xe9\xbe\x99\xe8\x88\x9f', b'long2 zhou1', b'/dragon boat/imperial boat/')
-
\$\begingroup\$
groups()
andgroupdict()
are great! \$\endgroup\$jsj– jsj2013年02月10日 18:21:43 +00:00Commented Feb 10, 2013 at 18:21
My goal here is clarity above all else. I think a first step is to use the maxsplit
argument of split
to get the first two pieces and the remainder:
trad, simp, remainder = line.rstrip().split(' ', 2)
Now, to parse the leftovers I'm afraid I only see slightly ugly choices. Some people like regular expressions and others hate them. Without regular expressions, I think it's easiest to view the remainder as two field separated with "] "
pinyin, english = remainder.split("] ")
pinyin = pinyin[1:] # get rid of leading '['
I would split around the square brackets first
def parse_string(s):
a, b = s.rstrip().split(' [', 2)
return a.split(' ') + b.split('] ', 2)
or more explicitly
def parse_string(s):
first, rest = s.rstrip().split(' [', 2)
trad, simp = first.split(' ', 2)
pinyin, english = rest.split('] ', 2)
return trad, simp, pinyin, english
You could perhaps try using the parse module, which you need to download from pypi. It's intended to function in the opposite manner from the format
method.
>>> import parse
>>> data = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> match = "{} {} [{}] {}\n"
>>> result = parse.parse(match, data)
>>> print result
<Result ('龍舟', '龙舟', 'long2 zhou1', '/dragon boat/imperial boat/') {}>
>>> print result[0]
'龍舟'
If you want to be able to access the result as a dictionary, you could name each of the parameters inside the brackets:
>>> import parse
>>> data = "龍舟 龙舟 [long2 zhou1] /dragon boat/imperial boat/\n"
>>> match = "{trad} {simp} [{pinyin}] {english}\n"
>>> result = parse.parse(match, data)
>>> print result
<Result () {'english': '/dragon boat/imperial boat/', 'trad': '龍舟', 'simp': '龙舟', 'pinyin': 'long2 zhou1'}>
>>> print result['trad']
'龍舟'
-
\$\begingroup\$ Is there some way to specify that the first two groups shouldn't contain spaces? Because I think unexpected input should cause an error, not silently return unexpected results. \$\endgroup\$svick– svick2013年02月10日 17:38:01 +00:00Commented Feb 10, 2013 at 17:38