-2

I am trying to scrape a webpage whose charset like this

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

and when I get the page source using python requests, I get content like this:

&#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;

How can I get original content out of these string in python?

Martijn Pieters
1.1m326 gold badges4.2k silver badges3.5k bronze badges
asked Feb 7, 2016 at 13:23
2

1 Answer 1

0

These are HTML entities encoding Unicode codepoints, and are not really using UTF-8; it could have been encoded as ASCII without loss of functionality. Use a HTML parser, such as BeautifulSoup. It'll handle such content for you:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
... </head><body>
... &#2453;&#2469;&#2494;&#2527; &#2476;&#2482;&#2503;- &#2478;&#2494;&#2459;&#2503; &#2477;&#2494;&#2468;&#2503; &#2476;&#2494;&#2457;&#2494;&#2482;&#2495;&#2404;</p> <p>&#2453;&#2476;&#2495; &#2440;&#2486;&#2509;&#2476;&#2480; &#2455;&#2497;&#2474;&#2509;&#2468; &#2438;&#2480;&#2503;&#2453; &#2471;&#2494;&#2474; &#2447;&#2455;&#2495;&#2527;&#2503; &#2476;&#2482;&#2503;&#2472;, '&#2477;&#2494;&#2468;-&#2478;&#2494;&#2459; &#2454;&#2503;&#2527;&#2503; &#2476;&#2494;&#2433;&#2458;&#2503; &#2476;&#2494;&#2457;&#2509;&#2455;&#2494;&#2482;&#2495; &#2488;&#2453;&#2482;/ &#2471;&#2494;&#2472;&#2503; &#2477;&#2480;&#2494; &#2477;
... </body></html>''', 'lxml')
>>> soup
<html><head><meta content="text/html; charset=unicode-escape" http-equiv="Content-Type"/>\n</head><body>\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 <p>\u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n</p></body></html>
>>> soup.get_text()
u"\n\n\u0995\u09a5\u09be\u09df \u09ac\u09b2\u09c7- \u09ae\u09be\u099b\u09c7 \u09ad\u09be\u09a4\u09c7 \u09ac\u09be\u0999\u09be\u09b2\u09bf\u0964 \u0995\u09ac\u09bf \u0988\u09b6\u09cd\u09ac\u09b0 \u0997\u09c1\u09aa\u09cd\u09a4 \u0986\u09b0\u09c7\u0995 \u09a7\u09be\u09aa \u098f\u0997\u09bf\u09df\u09c7 \u09ac\u09b2\u09c7\u09a8, '\u09ad\u09be\u09a4-\u09ae\u09be\u099b \u0996\u09c7\u09df\u09c7 \u09ac\u09be\u0981\u099a\u09c7 \u09ac\u09be\u0999\u09cd\u0997\u09be\u09b2\u09bf \u09b8\u0995\u09b2/ \u09a7\u09be\u09a8\u09c7 \u09ad\u09b0\u09be \u09ad\n"
>>> print soup.get_text()
কথায় বলে- মাছে ভাতে বাঙালি। কবি ঈশ্বর গুপ্ত আরেক ধাপ এগিয়ে বলেন, 'ভাত-মাছ খেয়ে বাঁচে বাঙ্গালি সকল/ ধানে ভরা ভ
answered Feb 7, 2016 at 13:29
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.