11

I got a file which includes many lines of plain utf-8 text. Such as below, by the by, it's Chinese.

PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012年2月29日 10:13:08

The file itself was saved in utf-8 format. file name is xx.txt

here is my python code, env is python2.7

#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+)元')
for line in open('xx.txt'):
 match = pattern.match(line.decode('utf-8'))
 if match:
 print match.group()

The problematic thing here is I got no results.

I wanna get the decimal string from 交易金额:0.01元, in here, which is 0.01.

Why doesn't this code work? Can anyone explain it to me, I got no clue whatsoever.

ThiefMaster
320k85 gold badges608 silver badges648 bronze badges
asked May 11, 2012 at 6:25

4 Answers 4

19

There are several issues with your code. First you should use re.compile(ur'<unicode string>'). Also it is nice to add re.UNICODE flag (not sure if really needed here though). Next one is that still you will not receive a match since \d+ doesn't handle decimals just a series of numbers, you should use \d+\.?\d+ instead (you want number, probably a dot and a number). Example code:

#coding: utf-8
text = u"PROCESS:类型:关爱积分[NOTIFY] 交易号:2012022900000109 订单号:W12022910079166 交易金额:0.01元 交易状态:true 2012年2月29日 10:13:08"
import re
pattern = re.compile(ur'交易金额:(\d+\.?\d+)元', re.UNICODE)
print pattern.search(text).group(1)
answered May 11, 2012 at 6:45
Sign up to request clarification or add additional context in comments.

Comments

5

You need to use .search() since .match() is like starting your regex with ^, i.e. it only checks at the beginning of the string.

answered May 11, 2012 at 6:27

1 Comment

still not working. can u provide your code to accomplish this little task, much appreciated
1

If you use utf-8, you can use flags=re.LOCALE

#coding: utf-8
import re
pattern = re.compile(r'交易金额:(\d+\.?\d+)元', flags=re.LOCALE)
for line in open('xx.txt'):
 match = pattern.match(line)

More details, see re.LOCALE. There is no need to convert utf-8 to unicode.

answered Oct 31, 2016 at 10:22

Comments

0

Your code has two small mistakes:

  1. Using a bytes regex on a unicode string.
  2. Missing decimal point in the regex.

When we fix the above mistakes, we get the following code.

#coding: utf-8
import re
pattern = re.compile(r'交易金额:([\.\d]+)元')
for line in open('xx.txt'):
 match = pattern.search(line)
 if match:
 print (match.groups())

In Python 2, it works because the regex and file are both byte strings. In Python 3, it works because both are unicode strings. This is a great example of how decoding the UTF-8 is unnecessary in 99% of programs (yet some languages insist on suggesting it).

answered Feb 27, 2024 at 4:20

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.