What's wrong with this code?
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("utf-8"))
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,
ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});
delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=
[a,b];return false}};
window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}
var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&
window._gjuc())&&setTimeout(_gjp,500)};
Traceback (most recent call last):
File "<pyshell#109>", line 2, in <module>
print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte
asked Jan 3, 2012 at 8:52
Sergey
50.2k28 gold badges94 silver badges132 bronze badges
2 Answers 2
Google sends you text in windows-1251 encoding, it says it in meta tag. This will work:
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("cp1251"))
Sign up to request clarification or add additional context in comments.
Comments
That's your failing line (last part of it):
>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte
The failing code is from a spanish word that has accent:
>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'
You will be ok with latins decoding accordingly:
>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'
btw, Imágenes is spanish images
answered Jan 3, 2012 at 9:13
joaquin
86k31 gold badges146 silver badges155 bronze badges
2 Comments
demalexx
Seems Google returns localized page depending on IP. For me it's Russian and cp1251 encoding. For you it's Spanish and latin-1.
joaquin
@race1 Oh I see! Interesting... I was fooled because my error was at pos 2419 after the same line the OP posted. But the one of the OP is at 2364... These are coincident answers by coincidence, arent they?
lang-py