【workaround】urllib.urlretrieve下载图片速度很慢 + 【已解决】给urllib.urlretrieve添加user-agent

【问题】

在折腾，给BlogstoWordpress添加QQ空间搬家到Wordpress的功能。其中用到urllib.urlretrieve下载QQ空间的图片，但是发现下载速度很慢，非常的慢。

此处的慢指的是，原先是几十K的图片，本来正常只需要不到1秒就下载完毕的，结果实际花了几分钟才下载完。

不是一般的慢，是特别的慢。

【解决过程】

1.开始以为是网速有问题，但是后经查证，网络没问题。

并且下载其他博客的图片，比如百度空间的图片，速度都还是很快的。

2.试了试对应的qq空间的图片的地址：

<a href="http://s9.photo.store.qq.com/http_imgload.cgi?/rurl4_b=521ecb3f97727a4712812471567bebeeacf8cc423d3a56014196aaf26e799930f0dcea0c85662f530eb17e6ed5f2b0434c7a2543d6e5cfdd63024ac21d06367c454183a134219606a9224fc11e48f8248ec4b1f8" data-original-title="" title="">http://s9.photo.store.qq.com/http_imgload.cgi?/rurl4_b=521ecb3f97727a4712812471567bebeeacf8cc423d3a56014196aaf26e799930f0dcea0c85662f530eb17e6ed5f2b0434c7a2543d6e5cfdd63024ac21d06367c454183a134219606a9224fc11e48f8248ec4b1f8</a>

在浏览器中输入，结果同样的几十K的图片，瞬间就显示了，所以，看来至少浏览器中下载图片速度是正常的。

3.猜测是QQ空间对refer值做了判断。

然后看到这里： [CPyUG:60237] Re: urllib获取文件速度很慢的问题，提到了和refer所相关的user-agent：

速度出奇的慢，下载下来要差不多一分钟，但是下载其他网站的图片没有问题。似乎的确是这样，我用{‘User-agent’ :

‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)’}就快多了，可能新浪在判断这个东西。

所以，就去试试，给urllib.urlretrieve添加user-agent（如果不行，再试试添加refer为QQ空间的地址）

【给urllib.urlretrieve添加user-agent】

（1）根据上面[CPyUG:60237] Re: urllib获取文件速度很慢的问题中所说，尝试给urllib.urlretrieve添加对应的user-agent。而关于给urllib.urlretrieve添加user-agent的问题，网上找了半天，就看到这里：

Urlretrieve and User-Agent? – Python

中给出的参考代码：

opener = FancyURLopener({})  
opener.verion = 'Mozilla/5.0'
opener.retrieve('http://example.com', 'index.html')

然后试了如下代码：

opener = urllib.FancyURLopener({});
opener.verion = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)';
opener.retrieve(fileUrl, fileToSave, reportHook);

结果问题依旧，速度还是很慢。

（2）然后又去找到了python手册中的解释：

urllib._urlopener
The public functions urlopen() and urlretrieve() create an instance of the FancyURLopener class and use it to perform their requested actions. To override this functionality, programmers can create a subclass of URLopener or FancyURLopener, then assign an instance of that class to the urllib._urlopener variable before calling the desired function. For example, applications may want to specify a different User-Agent header than URLopener defines. This can be accomplished with the following code:
import urllib

class AppURLopener(urllib.FancyURLopener):
    version = "App/1.7"

urllib._urlopener = AppURLopener()

然后写了对应的代码：

class AppURLopener(urllib.FancyURLopener):
    version = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT)";
 
def downloadFile():
    urllib._urlopener = AppURLopener();
     
    urllib.urlretrieve(fileUrl, fileToSave);

结果是，至少看起来，的确是可以给urllib.urlretrieve添加对应的User-Agent了，但是却对此处的问题，下载图片速度慢的问题，没任何影响。

貌似看起来，好像问题不在这个User-Agent。

所以只能继续折腾。

4.发现上面原先要现在的QQ空间的图片地址，有时候是需要跳转redirect的，所以又去先获得对应当前真实的图片地址：

1 2	`resp` `=` `urllib2.urlopen(fileUrl);` `realUrl` `=` `resp.geturl();`

然后再去下载对应的realUrl，结果问题依旧。

5.添加对应的cookie支持：

# add cookie to test whether can speedup pic download
gVal['cj'] = cookielib.CookieJar();
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(gVal['cj']));
urllib2.install_opener(opener);

问题依旧，下载还是很慢。

6.后来逼急了，手动去调用urllib2.urlopen去打开对应的url，获得对应的response，然后read出对应的图片二进制数据，然后保存图片为对应的本地文件：

resp = urllib2.urlopen(fileUrl); # note: Python 2.6 has added timeout support.
realUrl = resp.geturl();
 
url = str(realUrl);
req = urllib2.Request(url);
#req.add_header('Refer', "http://user.qzone.qq.com/622007179/blog/");
#req.add_header('User-Agent', gConst['userAgentIE9']);
#req.add_header('Cache-Control', 'no-cache');
#req.add_header('Accept', '*/*');
#req.add_header('Accept-Encoding', 'gzip, deflate');
#req.add_header('Connection', 'Keep-Alive');
resp = urllib2.urlopen(req);
 
respHtml = resp.read();
 
binfile = open(fileToSave, "wb");
binfile.write(respHtml);
 
binfile.close();
 
print "save pic OK";

上述代码中，有几点需要注意的是：

（1）经过测试，发现原以为的Refer，User-Agent，甚至Cache-control等，全部都对图片下载的速度没影响。

所以，暂时此处的结果就是，通过urllib.urlretrieve下载图片，不知何种原因，速度狂慢，而自己手动去通过urllib2.urlopen打开图片地址，获得图片数据的话，却都是很正常的。

【总结】

下面分析一下，用urllib.urlretrieve下载QQ空间的图片速度慢，这一问题的可能的原因：

Chrome中抓出的来的Http的GET请求信息：

1	`GET /http_imgload.cgi?/rurl4_b=521ecb3f97727a4712812471567bebee7a4969da4a435c2afc49a751497b7d1b011ffa9ba753345b408a05762102c98074de25ae01872185682efd4b863d99e123e4109ea6725c199673f178dc530410eaee9b79 <font style="background-color: #ffff00">HTTP/1.1</font>`

和IE9中F12抓出来的：

1 2	`键值` `请求 GET /http_imgload.cgi?/rurl4_b=521ecb3f97727a4712812471567bebeed32313b0721c3dd95eae6e8bf135143d68875a3b78a06b0c840db8d7eae6f59d05be2f71aa1b610c0e41aa934aab83fb003600cd6b42237fd24ed784c918f4311076a93f <font style="background-color: #ffff00">HTTP/1.1</font>`

可以看出，HTTP协议的版本，都是是1.1的，而对于urllib.urlretrieve，其Python 2.7手册中的解释是：

20.5.4. urllib Restrictions
Currently, only the following protocols are supported: HTTP, (versions 0.9 and 1.0), FTP, and local files.

即HTTP是0.9和1.0版本的。

所以，也不知道是不是这个HTTP版本的原因导致此问题的，即不知道是不是：

对于QQ空间的图片，即放在各个QQ自己的服务器上面的那些QQ空间的图片：

如果是普通浏览器中用HTTP 1.1下载图片是正常的，速度很快的；

而Python中的urllib.urlretrieve去用HTTP 0.9和1.0，去下载图片速度就很慢了。

目前仅是猜测，根本原因还是未知。

转载请注明：在路上 » 【workaround】urllib.urlretrieve下载图片速度很慢 + 【已解决】给urllib.urlretrieve添加user-agent

Post Views: 6,179

【workaround】urllib.urlretrieve下载图片速度很慢 + 【已解决】给urllib.urlretrieve添加user-agent

20.5.4. `urllib` Restrictions

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (2)

20.5.4. urllib Restrictions

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (2)

订阅在路上

20.5.4. `urllib` Restrictions