【问题】
在用Python的urllib2等库,访问网络,发现某些网址访问很慢,比如:
http://www.wheelbynet.com/docs/auto/view_ad2.php3?ad_ref=auto58XXKHTS7098
但是,当使用代理(此处用的是gae)后,发现访问速度就快很多了。
所以,希望给Python的访问网络,增加代理的支持。
【折腾过程】
1.参考:
http://docs.python.org/2/library/urllib2.html
http://docs.python.org/2/library/urllib2.html#urllib2.ProxyHandler
urllib2.proxyhandler in python 2.5
去试试代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | def initProxy(singleProxyDict = {}): """Add proxy support for later urllib2 auto use this proxy Note: 1. tmp not support username and password 2. after this init, later urllib2.urlopen will automatically use this proxy """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=" ,proxyHandler; proxyOpener = urllib2.build_opener(proxyHandler); print "proxyOpener=" ,proxyOpener; urllib2.install_opener(proxyOpener); return ; |
然后就可以看到对应的gae的代理被调用到了:
INFO – [Jul 02 12:59:02] 127.0.0.1:52880 "GAE GET http://www.baidu.com HTTP/1.1" 200 10407 |
【总结】
如下函数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | def initProxy(singleProxyDict = {}): """Add proxy support for later urllib2 auto use this proxy Note: 1. tmp not support username and password 2. after this init, later urllib2.urlopen will automatically use this proxy """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=" ,proxyHandler; proxyOpener = urllib2.build_opener(proxyHandler); print "proxyOpener=" ,proxyOpener; urllib2.install_opener(proxyOpener); return ; |
调用方法:
先初始化:
1 |
正常使用:
任何后续的urllib2的访问网络,就已经使用到此代理了。比如:
1 |
如此即可。
【后记】
1.后来发现,此处有点问题:
得到的html都是乱码。
原因是用了cookie,又用了代理:
1 2 3 4 | #init crifanLib.initAutoHandleCookies(); #here use gae 127.0.0.1:8087 |
后来参考官网的解释:
The following exceptions are raised as appropriate: |
所以再去改为:
1 | crifanLib.initProxyAndCookie({ 'http' :<a href = "http://127.0.0.1:8087" data - original - title = " " title=" ">http: / / 127.0 . 0.1 : 8087 < / a>}); |
和
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | def initProxyAndCookie(singleProxyDict = {}, localCookieFileName = None ): """Init proxy and cookie Note: 1. after this init, later urllib2.urlopen will auto, use proxy, auto handle cookies 2. for proxy, tmp not support username and password """ proxyHandler = urllib2.ProxyHandler(singleProxyDict); print "proxyHandler=" ,proxyHandler; if (localCookieFileName): gVal[ 'cookieUseFile' ] = True ; #print "use cookie file"; #gVal['cj'] = cookielib.FileCookieJar(localCookieFileName); #NotImplementedError gVal[ 'cj' ] = cookielib.LWPCookieJar(localCookieFileName); # prefer use this #gVal['cj'] = cookielib.MozillaCookieJar(localCookieFileName); # second consideration #create cookie file gVal[ 'cj' ].save(); else : #print "not use cookie file"; gVal[ 'cookieUseFile' ] = False ; gVal[ 'cj' ] = cookielib.CookieJar(); proxyAndCookieOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(gVal[ 'cj' ]), proxyHandler); print "proxyAndCookieOpener=" ,proxyAndCookieOpener; urllib2.install_opener(proxyAndCookieOpener); return ; |
结果还是返回的html是乱码。
2.感觉像是html的解压缩有问题。
进过一番折腾,参考:
去写了代码,终于是可以正常处理:
Content-Encoding: deflate
类型的html了。
(之前只能处理:
Content-Encoding: gzip
类型的html)
3.最后结果是:
上面的返回html是乱码,不是之前的urllib2的install_opener之类的导致的,而是返回的压缩的html,即gzip或deflate所导致的,最终通过如下代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | def getUrlRespHtml(url, postDict = {}, headerDict = {}, timeout = 0 , useGzip = True , postDataDelimiter = "&" ) : resp = getUrlResponse(url, postDict, headerDict, timeout, useGzip, postDataDelimiter); respHtml = resp.read(); if (useGzip) : #print "---before unzip, len(respHtml)=",len(respHtml); respInfo = resp.info(); # Server: nginx/1.0.8 # Date: Sun, 08 Apr 2012 12:30:35 GMT # Content-Type: text/html # Transfer-Encoding: chunked # Connection: close # Vary: Accept-Encoding # ... # Content-Encoding: gzip # sometime, the request use gzip,deflate, but actually returned is un-gzip html # -> response info not include above "Content-Encoding: gzip" # -> so here only decode when it is indeed is gziped data #Content-Encoding: deflate if ( "Content-Encoding" in respInfo): if ( "gzip" in respInfo[ 'Content-Encoding' ]): respHtml = zlib.decompress(respHtml, 16 + zlib.MAX_WBITS); if ( "deflate" in respInfo[ 'Content-Encoding' ]): respHtml = zlib.decompress(respHtml, - zlib.MAX_WBITS); return respHtml; |
而支持了是gzip或deflate。
注:更多关于crifanLib.py参见:
http://code.google.com/p/crifanlib/source/browse/trunk/python/crifanLib.py
转载请注明:在路上 » 【已解决】Python中使用代理访问网络