Swedish Unicode issues in python

Asked 10 years, 1 month ago

Viewed 82 times

I am new to this forum (this is my first question), so please bear with me. I am scraping a website in Swedish. It's using the ISO-8859-1 charset.

In the source it might look something like this:

<div class="fl icon-post-old"></div>
 2015年11月13日, 15:09
 <a href="

Let's say I want to grab the date and time (this is not a real example).

threadcode=opener.open(threadurl).read()
threadcode2=threadcode.decode("ISO-8859-1")
post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode2))
post2=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',str(threadcode))
print (post) #this is blank
print (post2) #this works fine

So, if I am searching for something in the "nice readable Swedish variable post", it does not seem to work. However, if I do the same search using the Unicode representation (which is not very useful) then the same search works.

Anyone of you nice programmers out there who knows what's going on here?

I can also add, if it helps that in some cases the search actually works... For example:

post=re.findall(r'Jag vill(.*?)bil',str(threadcode2))

This would work...

I am very confused.

Improve this question

edited May 3, 2016 at 3:17

A. Campbell's user avatar

A. Campbell

4142 silver badges12 bronze badges

asked Nov 26, 2015 at 11:23

cbroxe's user avatar

cbroxe

11 bronze badge

Your strings must be unicode strings, here is a good starting point: stackoverflow.com/questions/1327731/…

user2390182
– user2390182

2015年11月26日 11:33:53 +00:00
Commented Nov 26, 2015 at 11:33

Add a comment |

2 Answers 2

Sorted by: Reset to default

Nothing to do with Swedish. I think re is borking on the multiline. If you do something like:

post=re.findall(
 r'<div class="fl icon-post-old"></div>(.*?)<a',
 threadcode2.replace('\n','')
)

You'll get your expected result.

Improve this answer

answered Nov 26, 2015 at 11:44

dda's user avatar

dda

6,2212 gold badges28 silver badges37 bronze badges

Comments

You should pass the re.UNICODE flag when passing unicode strings into re.findall:

post=re.findall(r'<div class="fl icon-post-old"></div>(.*?)<a',threadcode2, flags=re.UNICODE)

Improve this answer

edited Nov 27, 2015 at 16:06

answered Nov 26, 2015 at 11:31

babbageclunk's user avatar

babbageclunk

8,7911 gold badge36 silver badges37 bronze badges

Comments

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

CollectivesTM on Stack Overflow

Swedish Unicode issues in python

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related