I am trying to scrape a webpage whose charset like this
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
and when I get the page source using python requests, I get content like this:
কথায় বলে- মাছে ভাতে বাঙালি।</p> <p>কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ
How can I get original content out of these string in python?
Martijn Pieters
1.1m326 gold badges4.2k silver badges3.5k bronze badges
-
1Use a HTML parser; it'll handle HTML entities for you.Martijn Pieters– Martijn Pieters2016年02月07日 13:26:39 +00:00Commented Feb 7, 2016 at 13:26
-
Read The Absolute Minimum Every Software Developer Must Know About Unicode and Character Setsbastelflp– bastelflp2016年02月07日 13:49:10 +00:00Commented Feb 7, 2016 at 13:49
1 Answer 1
These are HTML entities encoding Unicode codepoints, and are not really using UTF-8; it could have been encoded as ASCII without loss of functionality. Use a HTML parser, such as BeautifulSoup. It'll handle such content for you:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
... </head><body>
... কথায় বলে- মাছে ভাতে বাঙালি।</p> <p>কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ
... </body></html>''', 'lxml')
>>> soup
<html><head><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n</head><body>\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 <p>\u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n</p></body></html>
>>> soup.get_text()
u"\n\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 \u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n"
>>> print soup.get_text()
কথায় বলে- মাছে ভাতে বাঙালি। কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ
answered Feb 7, 2016 at 13:29
Martijn Pieters
1.1m326 gold badges4.2k silver badges3.5k bronze badges
Sign up to request clarification or add additional context in comments.
Comments
lang-py