using output from beautifulsoup in python

Question 1

Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.

I have this table with the result of which I want the string content of all tags which i do like this:

from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
 print "Reading:" , url
 html = urlopen(url).read().lower()
 bs = BeautifulSoup(html)
 table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table") 
 rows = table.findAll(lambda tag: tag.name=='tr')
 rows.pop(0) #first row is header
 for row in rows:
 tags = row.findAll(lambda tag: tag.name=='a')
 content = []
 for tagcontent in tags:
 content.append(tagcontent.string)
 print content
if __name__ == '__main__':
 content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018年1月1日-DESC"
 metSoup = parseWithSoup(content)

however the output is as follows:

[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...

My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...

Question 2

>>> l = [u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke'] >>> l[1] u'gadget show live i..'. >>> print l[1] #still unicode but as you with print no u gadget show live i..

Question 3

The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.

Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.

Question 4

Just really curious: why wouldn't you recommend it? Eventually I want to output this into a .csv file, I don't want the u's to be there (or will this be taken care of automatically?)

Question 5

@Javaaa it's just part of the representation when Python shows you the data structure. It doesn't actually show up if you output to stdout or a file.

Question 6

You can not convert unicode strings to standard strings using str(). This is the typical US-advice where people only know ASCII. For converting unicode strings properly to a string you need to use the some_unicode_string.decode(encoding) method. Calling str() on a unicode string is never appropriate.

Question 7

@RestRisiko please don't misinterpret my ignorance of unicode as some kind of racism.

Question 8

Thanks! I was having trouble eliminating the u from a list of data with strings which contained numbers and I couldn't really figure out why it was there and what it meant.

Question 9

What you see are Python unicode strings.

Check the Python documentation

http://docs.python.org/howto/unicode.html

in order to deal correctly with unicode strings.

Question 10

this all makes a lot of sense now, had the same problem in another project using tweepy, thanks!

Rafe Kettler 77.2k21 gold badges161 silver badges152 bronze badges · Accepted Answer · 2011-04-17 15:10:56Z

3

The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.

Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.

Share

Improve this answer

edited Apr 17, 2011 at 15:17

answered Apr 17, 2011 at 15:10

Rafe Kettler's user avatar

Rafe Kettler

77.2k21 gold badges161 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Javaaaa

Javaaaa Over a year ago

Just really curious: why wouldn't you recommend it? Eventually I want to output this into a .csv file, I don't want the u's to be there (or will this be taken care of automatically?)

2011年04月17日T15:13:48.72Z+00:00

Rafe Kettler

Rafe Kettler Over a year ago

@Javaaa it's just part of the representation when Python shows you the data structure. It doesn't actually show up if you output to stdout or a file.

2011年04月17日T15:15:29.45Z+00:00

user2665694

user2665694 Over a year ago

You can not convert unicode strings to standard strings using str(). This is the typical US-advice where people only know ASCII. For converting unicode strings properly to a string you need to use the some_unicode_string.decode(encoding) method. Calling str() on a unicode string is never appropriate.

2011年04月17日T15:15:50.81Z+00:00

Rafe Kettler

Rafe Kettler Over a year ago

@RestRisiko please don't misinterpret my ignorance of unicode as some kind of racism.

2011年04月17日T15:16:38.267Z+00:00

Geosphere

Geosphere Over a year ago

Thanks! I was having trouble eliminating the u from a list of data with strings which contained numbers and I couldn't really figure out why it was there and what it meant.

2015年10月28日T16:06:16.11Z+00:00

CollectivesTM on Stack Overflow

using output from beautifulsoup in python

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related