I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.
PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012年2月29日 10:13:08
The file itself was saved in utf-8 format. file name is xx.txt
here is my python code, env is python2.7
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
match = pattern.match(line.decode('utf-8'))
if match:
print match.group()
The problematic thing here is I got no results.
I wanna get the decimal string from 交易金额:0.01元, in here, which is 0.01.
Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.
4 Answers 4
There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:
#coding: utf-8
text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012年2月29日 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)
Comments
You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.
1 Comment
If you use utf-8, you can use flags=re.LOCALE
#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
match = pattern.match(line)
More details, see re.LOCALE. There is no need to convert utf-8 to unicode.
Comments
Your code has two small mistakes:
- Using a
bytesregex on aunicodestring. - Missing decimal point in the regex.
When we fix the above mistakes, we get the following code.
#coding: utf-8
import re
pattern = re.compile(r'交易金额:([\.\d]+)元')
for line in open('xx.txt'):
match = pattern.search(line)
if match:
print (match.groups())
In Python 2, it works because the regex and file are both byte strings. In Python 3, it works because both are unicode strings. This is a great example of how decoding the UTF-8 is unnecessary in 99% of programs (yet some languages insist on suggesting it).