【问题】
在用Python的urllib2等库,访问网络,发现某些网址访问很慢,比如:
http://www.wheelbynet.com/docs/auto/view_ad2.php3?ad_ref=auto58XXKHTS7098
但是,当使用代理(此处用的是gae)后,发现访问速度就快很多了。
所以,希望给Python的访问网络,增加代理的支持。
【折腾过程】
1.参考:
http://docs.python.org/2/library/urllib2.html
http://docs.python.org/2/library/urllib2.html#urllib2.ProxyHandler
urllib2.proxyhandler in python 2.5
去试试代码:
def initProxy(singleProxyDict = {}): """Add proxy support for later urllib2 auto use this proxy Note: 1. tmp not support username and password 2. after this init, later urllib2.urlopen will automatically use this proxy """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=",proxyHandler; proxyOpener = urllib2.build_opener(proxyHandler); print "proxyOpener=",proxyOpener; urllib2.install_opener(proxyOpener); urllib2.urlopen("http://www.baidu.com"); return;
然后就可以看到对应的gae的代理被调用到了:
INFO – [Jul 02 12:59:02] 127.0.0.1:52880 "GAE GET http://www.baidu.com HTTP/1.1" 200 10407 |
【总结】
如下函数:
def initProxy(singleProxyDict = {}): """Add proxy support for later urllib2 auto use this proxy Note: 1. tmp not support username and password 2. after this init, later urllib2.urlopen will automatically use this proxy """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=",proxyHandler; proxyOpener = urllib2.build_opener(proxyHandler); print "proxyOpener=",proxyOpener; urllib2.install_opener(proxyOpener); return;
调用方法:
先初始化:
crifanLib.initProxy({'http':"http://127.0.0.1:8087"});
正常使用:
任何后续的urllib2的访问网络,就已经使用到此代理了。比如:
urllib2.urlopen("http://www.baidu.com");
如此即可。
【后记】
1.后来发现,此处有点问题:
得到的html都是乱码。
原因是用了cookie,又用了代理:
#init crifanLib.initAutoHandleCookies(); #here use gae 127.0.0.1:8087 crifanLib.initProxy({'http':"http://127.0.0.1:8087"});
后来参考官网的解释:
The following exceptions are raised as appropriate: |
所以再去改为:
crifanLib.initProxyAndCookie({'http':http://127.0.0.1:8087});
和
def initProxyAndCookie(singleProxyDict = {}, localCookieFileName=None): """Init proxy and cookie Note: 1. after this init, later urllib2.urlopen will auto, use proxy, auto handle cookies 2. for proxy, tmp not support username and password """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=",proxyHandler; if(localCookieFileName): gVal['cookieUseFile'] = True; #print "use cookie file"; #gVal['cj'] = cookielib.FileCookieJar(localCookieFileName); #NotImplementedError gVal['cj'] = cookielib.LWPCookieJar(localCookieFileName); # prefer use this #gVal['cj'] = cookielib.MozillaCookieJar(localCookieFileName); # second consideration #create cookie file gVal['cj'].save(); else: #print "not use cookie file"; gVal['cookieUseFile'] = False; gVal['cj'] = cookielib.CookieJar(); proxyAndCookieOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(gVal['cj']), proxyHandler); print "proxyAndCookieOpener=",proxyAndCookieOpener; urllib2.install_opener(proxyAndCookieOpener); return;
结果还是返回的html是乱码。
2.感觉像是html的解压缩有问题。
进过一番折腾,参考:
去写了代码,终于是可以正常处理:
Content-Encoding: deflate
类型的html了。
(之前只能处理:
Content-Encoding: gzip
类型的html)
3.最后结果是:
上面的返回html是乱码,不是之前的urllib2的install_opener之类的导致的,而是返回的压缩的html,即gzip或deflate所导致的,最终通过如下代码:
def getUrlRespHtml(url, postDict={}, headerDict={}, timeout=0, useGzip=True, postDataDelimiter="&") : resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter); respHtml = resp.read(); if(useGzip) : #print "---before unzip, len(respHtml)=",len(respHtml); respInfo = resp.info(); # Server: nginx/1.0.8 # Date: Sun, 08 Apr 2012 12:30:35 GMT # Content-Type: text/html # Transfer-Encoding: chunked # Connection: close # Vary: Accept-Encoding # ... # Content-Encoding: gzip # sometime, the request use gzip,deflate, but actually returned is un-gzip html # -> response info not include above "Content-Encoding: gzip" # eg: http://blog.sina.com.cn/s/comment_730793bf010144j7_3.html # -> so here only decode when it is indeed is gziped data #Content-Encoding: deflate if("Content-Encoding" in respInfo): if("gzip" in respInfo['Content-Encoding']): respHtml = zlib.decompress(respHtml, 16+zlib.MAX_WBITS); if("deflate" in respInfo['Content-Encoding']): respHtml = zlib.decompress(respHtml, -zlib.MAX_WBITS); return respHtml;
而支持了是gzip或deflate。
注:更多关于crifanLib.py参见:
http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
转载请注明:在路上 » 【已解决】Python中使用代理访问网络