Linked Questions
35 questions linked to/from Convert XML/HTML Entities into Unicode String in Python
390
votes
7
answers
368k
views
Decode HTML entities in Python string?
I'm parsing some HTML with Beautiful Soup 3, but it contains HTML entities which Beautiful Soup 3 doesn't automatically decode for me:
>>> from BeautifulSoup import BeautifulSoup
>>&...
6
votes
1
answer
49k
views
How do I get rid of characters like ' that appear instead of apostrophes? [duplicate]
Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
I am attempting to scrape a website using Python. I import and use the urllib2, BeautifulSoup and re modules.
response =...
3
votes
2
answers
2k
views
How do I convert characters like ":" to ":" in python? [duplicate]
Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
In html sources, there are tons of chars like "&# 58;" or "&# 46;" (have to put space between &# and numbers ...
-1
votes
1
answer
839
views
Convert ascii characters to normal text [duplicate]
I have text like this:
‘The zoom animations everywhere on the new iOS 7 are literally making me nauseous and giving me a headache,’wroteforumuser Ensorceled.
I understand that #...
1
vote
1
answer
748
views
encoding/decoding unicode and utf-8 : Python [duplicate]
I have a html text : If I'm reading lots of articles
I am trying to replace ' and other such special characters into unicode '. I did
rawtxt.encode('utf-8').encode('ascii','ignore'...
0
votes
2
answers
1k
views
Python, XML, é type encodings [duplicate]
Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
I am reading an excel XML document using Python. I end up with a lot of characters such as
é
That ...
2
votes
1
answer
257
views
Unicode encoding in python [duplicate]
Possible Duplicate:
Convert XML/HTML Entities into Unicode String in Python
Decode HTML entities in Python string?
I am using Python 2.7 and am fairly lost in unicode type. I looked up variety ...
0
votes
0
answers
31
views
Anyone know what kind of unicode is this with &# and semicolon please? how to turn it into chinese characters string in Python please? [duplicate]
I have these symbols which I am quite sure they are chinese characters.
旅行時,我生病
Please anyone know what kind of unicode is ...
346
votes
37
answers
620k
views
Extracting text from HTML file using Python
I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more ...
179
votes
15
answers
254k
views
How do I perform HTML decoding/encoding using Python/Django?
I have a string that is HTML encoded:
'''<img class="size-medium wp-image-113"\
style="margin-left: 15px;" title="su1"\
src="...
16
votes
3
answers
20k
views
Replace html entities with the corresponding utf-8 characters in Python 2.6
I have a html text like this:
<xml ... >
and I want to convert it to something readable:
<xml ...>
Any easy (and fast) way to do it in Python?
13
votes
2
answers
13k
views
Change ' into normal character
I'm having trouble displaying content,
my program:
#! /usr/bin/python
import urllib
import re
url = "http://yahoo.com"
pattern = '''<span class="medium item-label".*?>(.*)</span>'''
...
7
votes
3
answers
20k
views
HTMLParser.HTMLParser().unescape() doesn't work
I would like to convert HTML entities back to its human readable format, e.g. '£' to '£', '°' to '°' etc.
I've read several posts regarding this question
Converting html source ...
6
votes
3
answers
2k
views
Getting international characters from a web page? [duplicate]
I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
...
3
votes
1
answer
3k
views
Decoding html content and HTMLParser
I'm creating a sub-class based on 'HTMLParser' to pull out html content. Whenever I have character refs such as
' ' '&' '–' '…'
I'd like to replace them with ...
3
votes
3
answers
2k
views
Make sequence of string.replace statements more readable
When I'm processing HTML code in Python I have to use the following code because of special characters.
line = string.replace(line, """, "\"")
line = string.replace(line, "'", "'")
...
xralf's user avatar
- 3,792
0
votes
4
answers
4k
views
Python Text Encoding
I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.
I read it as a unicode string. Do I need to ...
4
votes
3
answers
2k
views
Unescaping HTML entities (&#nnnn;) into plain UTF-8 [closed]
We have HTML source files which contain special characters encoded as &#nnnn; like in the word:
außergewöhnlich
We would like to convert them into plain UTF-8:
außergew...
5
votes
2
answers
2k
views
Safely remove all html code from a string in python
I've been reading many q&a on how to remove all the html code from a string using python but none was satisfying. I need a way to remove all the tags, preserve/convert the html entities and work ...
1
vote
1
answer
8k
views
Simple example BeautifulSoup Python
I was working a simple example with BeautifulSoup, but I was getting weird resutls.
Here is my code:
soup = BeautifulSoup(page)
print soup.prettify()
stuff = soup.findAll('td', attrs={'class' : '...
0
votes
2
answers
3k
views
convert html entities to unicode(utf-8) strings in c? [duplicate]
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function ...
1
vote
1
answer
3k
views
How to decode html hex elements?
Part of a website I'm trying to scrape has this weird block of hex values instead of characters. How can I decode this with python?
I am using urllib.request to get the page source
http://www....
0
votes
1
answer
3k
views
Python, convert HTML entities to Unicode
(Edit: I'm using Python 2.7)
(Edit 2: I have already checked Convert XML/HTML Entities into Unicode String in Python, the solutions do not work. Please do not flag this as already answered.)
I've ...
2
votes
0
answers
4k
views
use of \x27 to convert to apostrophe not working in python
I tried taking some data from the web:
Example:the name 'Schindler's list' is printed as 'Schindler's List' straight from the web... tried asking python to print 'Schindler\x27s list' instead ...
0
votes
1
answer
2k
views
Convert &#xxxx; to normal character?
lxml.etree.parse() have generate string in utf-16 file as &#xxxx; How can I convert it back?
Opening output file in web browser is fine. However I still need regular string in output file, too.
...
3
votes
2
answers
951
views
Decoding querystring parameter in Django view
I have a search form in my app that uses a jQuery autocomplete plugin. The plugin sends over the suggested item after running the querystring through encodeURI(q).
So an item like Johnny's sports ...
2
votes
0
answers
2k
views
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 45: ordinal not in range(128)
Sorry for posting this again. I am getting this error UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 45: ordinal not in range(128) when I run the following code strip_html():
...
1
vote
2
answers
1k
views
python reading unicode characters from html
I have this script, which reads the text from web page:
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page);
paragraphs = soup.findAll('p');
for p in paragraphs:
content = content+p....
0
votes
1
answer
220
views
python - possible encoding and decoding values
I'm trying to decode chatacters which have been encoded in the following way:
&#number;
I tried:
s.decode("utf8")
and:
s.decode("unicode-escape")
but both not seems to work.
What is the ...
2
votes
1
answer
433
views
Decoding html entities in python2
I have a string of escaped html markup , 'í', and I want it to the correct accented character 'í'.
Having read around SO, this is my attempt:
messy = 'í'
print type(messy)
>>&...
1
vote
1
answer
267
views
I want to save HTML Entity (hex) from bs4 beautifulSoup object into a file
The problem
from bs4 import BeautifulSoup
a=BeautifulSoup('<p class="t5">₹ 10,000 or $ 133.46</p>')
b=open('file.html','w')
b.write(str(a))
The result is
...
1
vote
1
answer
117
views
Inquiry: Why is my regex code not reading all characters?
I have the following description I want scrap using my program.
<hr>Provides AFROTC cadets up to 13 options for practical leadership and specialized training
through exposure to USAF ...
0
votes
1
answer
73
views
How to properly replace the contents of text file
I am trying to make an offline copy of this website: ieeghn. Part of this task is to download all css/js that being referred to using Beautiful Soup and modify any external link to this newly ...
0
votes
0
answers
56
views
Encoding of content
I am writing program, which collects data (title,author,article) from web page with news article. I use Readability Python library. My problem is that content(which programm) of article (if article is ...
1
vote
0
answers
30
views
Convert HTML entity to iOS Emoji? [duplicate]
I am getting HTML entity from server as JSON response example 😁 => 😁 , now i wish to show this emoji on my button. if i receive unicode for emoji it's works fine simply placing ...