3

Hey all, I am using beautifulsoup (after unsuccessfully struggling for two days with scrapy) to scrape starcraft 2 league data however I am encountering a problem.

I have this table with the result of which I want the string content of all tags which i do like this:

from BeautifulSoup import *
from urllib import urlopen
def parseWithSoup(url):
 print "Reading:" , url
 html = urlopen(url).read().lower()
 bs = BeautifulSoup(html)
 table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table") 
 rows = table.findAll(lambda tag: tag.name=='tr')
 rows.pop(0) #first row is header
 for row in rows:
 tags = row.findAll(lambda tag: tag.name=='a')
 content = []
 for tagcontent in tags:
 content.append(tagcontent.string)
 print content
if __name__ == '__main__':
 content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018年1月1日-DESC"
 metSoup = parseWithSoup(content)

however the output is as follows:

[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...

My question is: where does the u'' come from (is it from unicode?) and how can I remove this? I just need the strings that are in u''...

asked Apr 17, 2011 at 15:07
1
  • >>> l = [u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke'] >>> l[1] u'gadget show live i..'. >>> print l[1] #still unicode but as you with print no u gadget show live i.. Commented Apr 17, 2011 at 15:36

2 Answers 2

3

The u means Unicode string. It doesn't change anything for you as a programmer and you should just disregard it. Treat them like normal strings. You actually want this u there.

Be aware that all Beautiful Soup output is unicode. That's a good thing, because if you run across any Unicode characters in your scraping, you won't have any problems. If you really want to get rid of the u, (I don't recommend it), you can use the unicode string's decode() method.

answered Apr 17, 2011 at 15:10
Sign up to request clarification or add additional context in comments.

5 Comments

Just really curious: why wouldn't you recommend it? Eventually I want to output this into a .csv file, I don't want the u's to be there (or will this be taken care of automatically?)
@Javaaa it's just part of the representation when Python shows you the data structure. It doesn't actually show up if you output to stdout or a file.
You can not convert unicode strings to standard strings using str(). This is the typical US-advice where people only know ASCII. For converting unicode strings properly to a string you need to use the some_unicode_string.decode(encoding) method. Calling str() on a unicode string is never appropriate.
@RestRisiko please don't misinterpret my ignorance of unicode as some kind of racism.
Thanks! I was having trouble eliminating the u from a list of data with strings which contained numbers and I couldn't really figure out why it was there and what it meant.
0

What you see are Python unicode strings.

Check the Python documentation

http://docs.python.org/howto/unicode.html

in order to deal correctly with unicode strings.

answered Apr 17, 2011 at 15:16

1 Comment

this all makes a lot of sense now, had the same problem in another project using tweepy, thanks!

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.