Error while using urllib.request.urlopen in Python

Question 1

What's wrong with this code?

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
 print(line.decode("utf-8"))
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,
ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});
delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=
[a,b];return false}};
window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}
var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&
window._gjuc())&&setTimeout(_gjp,500)};
Traceback (most recent call last):
 File "<pyshell#109>", line 2, in <module>
 print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte

Question 2

Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
 print(line.decode("cp1251"))

Question 3

That's your failing line (last part of it):

>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
 File "<pyshell#12>", line 1, in <module>
 line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte

The failing code is from a spanish word that has accent:

>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'

You will be ok with latins decoding accordingly:

>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'

btw, Imágenes is spanish images

Question 4

Seems Google returns localized page depending on IP. For me it's Russian and cp1251 encoding. For you it's Spanish and latin-1.

Question 5

@race1 Oh I see! Interesting... I was fooled because my error was at pos 2419 after the same line the OP posted. But the one of the OP is at 2364... These are coincident answers by coincidence, arent they?

demalexx 4,7692 gold badges33 silver badges34 bronze badges · Accepted Answer · 2012-01-03 09:09:52Z

6

Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:

>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
 print(line.decode("cp1251"))

Share

Improve this answer

edited Jan 3, 2012 at 10:33

Sergey's user avatar

Sergey

50.2k28 gold badges94 silver badges132 bronze badges

answered Jan 3, 2012 at 9:09

demalexx's user avatar

demalexx

4,7692 gold badges33 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

CollectivesTM on Stack Overflow

Error while using urllib.request.urlopen in Python

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related