I am new to this forum (this is my first question), so please bear with me. I am scraping a website in Swedish. It's using the ISO-8859-1 charset.
In the source it might look something like this:
<div class="fl icon-post-old"></div>
2015年11月13日, 15:09
<a href="
Let's say I want to grab the date and time (this is not a real example).
threadcode=opener.open(threadurl).read()
threadcode2=threadcode.decode("ISO-8859-1")
post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode2))
post2=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode))
print (post) #this is blank
print (post2) #this works fine
So, if I am searching for something in the "nice readable Swedish variable post", it does not seem to work. However, if I do the same search using the Unicode representation (which is not very useful) then the same search works.
Anyone of you nice programmers out there who knows what's going on here?
I can also add, if it helps that in some cases the search actually works... For example:
post=re.findall(r'Jag vill(.*?)bil',str(threadcode2))
This would work...
I am very confused.
-
Your strings must be unicode strings, here is a good starting point: stackoverflow.com/questions/1327731/…user2390182– user23901822015年11月26日 11:33:53 +00:00Commented Nov 26, 2015 at 11:33
2 Answers 2
Nothing to do with Swedish. I think re is borking on the multiline. If you do something like:
post=re.findall(
r'<div class="fl icon-post-old"></div>(.*?)<a',
threadcode2.replace('\n','')
)
You'll get your expected result.
Comments
You should pass the re.UNICODE flag when passing unicode strings into re.findall:
post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',threadcode2, flags=re.UNICODE)