Web page special characters encoding

mattia gervaz at gmail.com
Sat Jul 10 19:17:03 EDT 2010


Il 2010年7月10日 16:24:23 +0000, mattia ha scritto:
> Hi all, I'm using py3k and the urllib package to download web pages. Can
> you suggest me a package that can translate reserved characters in html
> like "è", "ò", "é" in the corresponding correct
> encoding?
>> Thanks,
> Mattia

Basically I'm trying to get an html page and stripping out all the tags 
to obtain just plain text. John Nagle and Christian Heimes somehow 
figured out what I'm trying to do ;-)
So far what I've done, thanks to you suggestions:
import lxml.html
import lxml.html.clean
import urllib.request
import urllib.parse
from html.entities import entitydefs
import re
import sys
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; 
rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"}
def replace(m):
 if m.group(1) in entitydefs:
 return entitydefs[m.group(1)]
 else:
 return m.group(1)
def test(page):
 req = urllib.request.Request(page, None, HEADERS)
 page = urllib.request.urlopen(req)
 charset = page.info().get_content_charset()
 if charset is not None:
 html = page.read().decode(charset)
 else:
 html = page.read().decode("iso-8859-1")
 html = re.sub(r"&(\w+);", replace, html)
 cleaner = lxml.html.clean.Cleaner(safe_attrs_only = True, style = 
True)
 html = cleaner.clean_html(html)
 # create the element tree
 tree = lxml.html.document_fromstring(html)
 txt = tree.text_content()
 for x in txt.split():
 # DOS shell is not able to print characters like u'\u20ac' - 
why???
 try:
 print(x)
 except:
 continue
if __name__ == "__main__":
 if len(sys.argv) < 2:
 print("Usage:", sys.argv[0], "<webpage>")
 print("Example:", sys.argv[0], "http://www.bing.com")
 sys.exit()
 test(sys.argv[1])
Every new tips will be appreciated.
Ciao,
Mattia


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /