I am trying to parse one webpage by using Python 2.7 and I want to read entire HTML code. But result is like this ...
<html><head><script type="text/javascript">
location.replace( "http://captcha.search.daum.net/captcha/show?url=http%3A%2F%2Fsearch.daum.net%2Fsearch%3Fw%3Dnews%26nil_search%3Dbtn%26DA%3DNTB%26enc%3Dutf8%26cluster%3Dy%26cluster_page%3D1%26q%3D%25EB%25B3%25B4%25EA%25B3%25A0%25EC%2584%259C" );
</script>
</head></html>
I think this webpage is using JavaScript. How can I parse entire HTML code contained in JavaScript?
My python code is this ...
#-*- coding: utf-8 -*-
import urllib2
from bs4 import BeautifulSoup
url = "http://search.daum.net/search?w=news&nil_search=btn&DA=NTB&enc=utf8&cluster=y&cluster_page=1&q=%EB%B3%B4%EA%B3%A0%EC%84%9C"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
print soup
Tom Zych
13.7k9 gold badges38 silver badges55 bronze badges
-
1The page is sending you javascript which redirects to a CAPTCHA. This means is trying to prevent you from reading the website.Thom Wiggers– Thom Wiggers2015年08月08日 06:49:11 +00:00Commented Aug 8, 2015 at 6:49
-
1I've rolled back your question because replacing the body by an image doesn't really help.Thom Wiggers– Thom Wiggers2015年08月08日 06:53:29 +00:00Commented Aug 8, 2015 at 6:53
1 Answer 1
It seems some headers are required for this page to be shown properly.
Try adding page headers from your request to your soup command, sending the same parameters as your browser send to get the result u see in the browser
answered Aug 8, 2015 at 6:57
alizelzele
9233 gold badges19 silver badges34 bronze badges
Sign up to request clarification or add additional context in comments.
Comments
default