Linked Questions

390 votes
7 answers
368k views

I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me: >>> from BeautifulSoup import BeautifulSoup >>&...
jkp's user avatar
  • 81.8k
6 votes
1 answer
49k views

Possible Duplicate: Convert XML/HTML Entities into Unicode String in Python I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules. response =...
3 votes
2 answers
2k views

Possible Duplicate: Convert XML/HTML Entities into Unicode String in Python In html sources, there are tons of chars like "&# 58;" or "&# 46;" (have to put space between &# and numbers ...
-1 votes
1 answer
839 views

I have text like this: ‘The zoom animations everywhere on the new iOS 7 are literally making me nauseous and giving me a headache,’wroteforumuser Ensorceled. I understand that #...
user2784753's user avatar
1 vote
1 answer
748 views

I have a html text : If I'm reading lots of articles I am trying to replace ' and other such special characters into unicode '. I did rawtxt.encode('utf-8').encode('ascii','ignore'...
Harshit's user avatar
  • 1,217
0 votes
2 answers
1k views

Possible Duplicate: Convert XML/HTML Entities into Unicode String in Python I am reading an excel XML document using Python. I end up with a lot of characters such as é That ...
2 votes
1 answer
257 views

Possible Duplicate: Convert XML/HTML Entities into Unicode String in Python Decode HTML entities in Python string? I am using Python 2.7 and am fairly lost in unicode type. I looked up variety ...
rodling's user avatar
  • 998
0 votes
0 answers
31 views

I have these symbols which I am quite sure they are chinese characters. 旅行時,我生病 Please anyone know what kind of unicode is ...
Carson Yau's user avatar
346 votes
37 answers
620k views

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad. I'd like something more ...
179 votes
15 answers
254k views

I have a string that is HTML encoded: '''<img class="size-medium wp-image-113"\ style="margin-left: 15px;" title="su1"\ src="...
rksprst's user avatar
  • 6,651
16 votes
3 answers
20k views

I have a html text like this: &lt;xml ... &gt; and I want to convert it to something readable: <xml ...> Any easy (and fast) way to do it in Python?
13 votes
2 answers
13k views

I'm having trouble displaying content, my program: #! /usr/bin/python import urllib import re url = "http://yahoo.com" pattern = '''<span class="medium item-label".*?>(.*)</span>''' ...
Vor's user avatar
  • 35.6k
7 votes
3 answers
20k views

I would like to convert HTML entities back to its human readable format, e.g. '&pound;' to '£', '&deg;' to '°' etc. I've read several posts regarding this question Converting html source ...
D.Q.'s user avatar
  • 547
6 votes
3 answers
2k views

I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as &#196;&#196;RITALO! ...
3 votes
1 answer
3k views

I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I have character refs such as '&nbsp;' '&amp;' '&ndash;' '&#8230;' I'd like to replace them with ...
Dan Holman's user avatar
3 votes
3 answers
2k views

When I'm processing HTML code in Python I have to use the following code because of special characters. line = string.replace(line, "&quot;", "\"") line = string.replace(line, "&apos;", "'") ...
xralf's user avatar
  • 3,792
0 votes
4 answers
4k views

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recu&#xE9;rdame. I read it as a unicode string. Do I need to ...
4 votes
3 answers
2k views

We have HTML source files which contain special characters encoded as &#nnnn; like in the word: au&#223;ergew&#246;hnlich We would like to convert them into plain UTF-8: außergew&#...
dagnelies's user avatar
  • 5,345
5 votes
2 answers
2k views

I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work ...
1 vote
1 answer
8k views

I was working a simple example with BeautifulSoup, but I was getting weird resutls. Here is my code: soup = BeautifulSoup(page) print soup.prettify() stuff = soup.findAll('td', attrs={'class' : '...
0 votes
2 answers
3k views

Possible Duplicate: How to decode HTML Entities in C? This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function ...
1 vote
1 answer
3k views

Part of a website I'm trying to scrape has this weird block of hex values instead of characters. How can I decode this with python? I am using urllib.request to get the page source http://www....
0 votes
1 answer
3k views

(Edit: I'm using Python 2.7) (Edit 2: I have already checked Convert XML/HTML Entities into Unicode String in Python, the solutions do not work. Please do not flag this as already answered.) I've ...
GrantD71's user avatar
  • 1,885
2 votes
0 answers
4k views

I tried taking some data from the web: Example:the name 'Schindler's list' is printed as 'Schindler&#x27s List' straight from the web... tried asking python to print 'Schindler\x27s list' instead ...
melony's user avatar
  • 75
0 votes
1 answer
2k views

lxml.etree.parse() have generate string in utf-16 file as &#xxxx; How can I convert it back? Opening output file in web browser is fine. However I still need regular string in output file, too. ...
3 votes
2 answers
951 views

I have a search form in my app that uses a jQuery autocomplete plugin. The plugin sends over the suggested item after running the querystring through encodeURI(q). So an item like Johnny's sports ...
Abid A's user avatar
  • 7,866
2 votes
0 answers
2k views

Sorry for posting this again. I am getting this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 45: ordinal not in range(128) when I run the following code strip_html(): ...
mchangun's user avatar
  • 10.6k
1 vote
2 answers
1k views

I have this script, which reads the text from web page: page = urllib2.urlopen(url).read() soup = BeautifulSoup(page); paragraphs = soup.findAll('p'); for p in paragraphs: content = content+p....
torayeff's user avatar
  • 9,732
0 votes
1 answer
220 views

I'm trying to decode chatacters which have been encoded in the following way: &#number; I tried: s.decode("utf8") and: s.decode("unicode-escape") but both not seems to work. What is the ...
tomermes's user avatar
  • 23.5k
2 votes
1 answer
433 views

I have a string of escaped html markup , '&#xed;', and I want it to the correct accented character 'í'. Having read around SO, this is my attempt: messy = '&#xed;' print type(messy) >>&...
1 vote
1 answer
267 views

The problem from bs4 import BeautifulSoup a=BeautifulSoup('<p class="t5">&#x20b9; 10,000 or $ 133.46</p>') b=open('file.html','w') b.write(str(a)) The result is ...
1 vote
1 answer
117 views

I have the following description I want scrap using my program. <hr>Provides AFROTC cadets up to 13 options for practical leadership and specialized training through exposure to USAF ...
0 votes
1 answer
73 views

I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly ...
swdev's user avatar
  • 5,257
0 votes
0 answers
56 views

I am writing program, which collects data (title,author,article) from web page with news article. I use Readability Python library. My problem is that content(which programm) of article (if article is ...
1 vote
0 answers
30 views

I am getting HTML entity from server as JSON response example &#128513; => 😁 , now i wish to show this emoji on my button. if i receive unicode for emoji it's works fine simply placing ...
Sheshnath's user avatar
  • 3,428