0

I am new to this forum (this is my first question), so please bear with me. I am scraping a website in Swedish. It's using the ISO-8859-1 charset.

In the source it might look something like this:

<div class="fl icon-post-old"></div>
 2015年11月13日, 15:09
 <a href="

Let's say I want to grab the date and time (this is not a real example).

threadcode=opener.open(threadurl).read()
threadcode2=threadcode.decode("ISO-8859-1")
post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode2))
post2=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode))
print (post) #this is blank
print (post2) #this works fine

So, if I am searching for something in the "nice readable Swedish variable post", it does not seem to work. However, if I do the same search using the Unicode representation (which is not very useful) then the same search works.

Anyone of you nice programmers out there who knows what's going on here?

I can also add, if it helps that in some cases the search actually works... For example:

post=re.findall(r'Jag vill(.*?)bil',str(threadcode2))

This would work...

I am very confused.

A. Campbell
4142 silver badges12 bronze badges
asked Nov 26, 2015 at 11:23
1

2 Answers 2

1

Nothing to do with Swedish. I think re is borking on the multiline. If you do something like:

post=re.findall(
 r'<div class="fl icon-post-old"></div>(.*?)<a',
 threadcode2.replace('\n','')
)

You'll get your expected result.

answered Nov 26, 2015 at 11:44
Sign up to request clarification or add additional context in comments.

Comments

1

You should pass the re.UNICODE flag when passing unicode strings into re.findall:

post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',threadcode2, flags=re.UNICODE)
answered Nov 26, 2015 at 11:31

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.