【问题】
手上有个Scrapy的项目,是要抓取和
相关的站点的内容。
已有源码为:
bs.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | import requests from bs4 import BeautifulSoup r = requests.get(seed_url) soup = BeautifulSoup(r.text) urlls = soup.find( "a" , "url" ) for url in urls: href = url.get( "href" ) r2 = requests.get(href) soup2 = BeautifulSoup(r2.text) |
scrapy.cfg:
1 2 3 4 5 6 7 8 9 10 11 | # Automatically created by: scrapy startproject # # For more information about the [deploy] section see: [settings] default = manta.settings [deploy] #url = http://localhost:6800/ project = manta |
items.py:
1 2 3 4 5 6 7 8 9 10 11 | # Define here the models for your scraped items # # See documentation in: from scrapy.item import Item, Field class MantaItem(Item): # define the fields for your item here like: # name = Field() pass |
pipelines.py:
1 2 3 4 5 6 7 8 | # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting class MantaPipeline( object ): def process_item( self , item, spider): return item |
settings.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # Scrapy settings for manta project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # #BOT_NAME = 'manta' SPIDER_MODULES = [ 'manta.spiders' ] NEWSPIDER_MODULE = 'manta.spiders' BOT_NAME = 'EchO!/2.0' DOWNLOAD_TIMEOUT = 15 DOWNLOAD_DELAY = 2 COOKIES_ENABLED = True COOKIES_DEBUG = True RETRY_ENABLED = False # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'manta (+http://www.yourdomain.com)' USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1" DEFAULT_REQUEST_HEADERS = { 'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' , 'Accept-Language' : 'en' , 'X-JAVASCRIPT-ENABLED' : 'true' , } DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : 400 , 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware' : 700 , } COOKIES_DEBUG = True |
很明显,核心代码是settings.py中的配置。
另外还有一些文件,总的文件架构为:
对应的,已有的返回内容responsebody,另存为html打开后,内容为:
Oops. Before you can move on, please activate your browser cookies. Incident Id: 51880fa5aa300 |
即,没有正常获取到:
的网页内容html的。
【解决过程】
1.得先参考:
搞清楚如何运行
2. 看起来,像是settings.py的配置有误,所以先去用IE9的F12调试看看本身的逻辑:
再去尝试改settings.py,结果都不行。
但是注意到,对应的代码:
COOKIES_ENABLED = True COOKIES_DEBUG = True |
运行的结果是:
2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Received cookies from: <200 http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama> Set-Cookie: SPSI=e760b4733042a6a1291db3b406fe8bfb ; path=/; domain=.manta.com 2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Crawled (200) <GET http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama> (referer: http://www.manta.com) |
即开始访问:
只返回了一个cookie:SPSI
对此,经过调试发现,其实IE9,也是同样的效果,但是由于返回的内容中,包含有:
<script type="text/javascript"> </script> |
所以IE9浏览器中,会执行对应的reload,所以会刷新页面,重新打开:
然后就可以获得上面所看到的正常的网页的内容了。
对应的调试结果就是上述的逻辑:
第一次,也只是获得了单个的cookie:
其中html中包含了reload:
第二次,通过刷新:
获得了真正页面的html:
而对于如此的访问url的逻辑:
需要针对:
访问两次才可以的逻辑,貌似Scrapy中,很难实现啊。
3.参考:
去添加上RedirectMiddleware试试:
1 2 3 4 5 | DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : 400 , 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware' : 600 , 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware' : 700 , } |
结果是,错误依旧。
4.参考:
Capturing http status codes with scrapy spider
貌似是可以通过自定义redirect的方式,去实现页面跳转的,但是现在还不太会。
5.截止目前,代码改的乱七八糟,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | # Scrapy settings for manta project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # #BOT_NAME = 'manta' SPIDER_MODULES = [ 'manta.spiders' ] NEWSPIDER_MODULE = 'manta.spiders' BOT_NAME = 'EchO!/2.0' DOWNLOAD_TIMEOUT = 15 DOWNLOAD_DELAY = 2 COOKIES_ENABLED = True #COOKIES_ENABLED = False COOKIES_DEBUG = True #RETRY_ENABLED = False RETRY_ENABLED = True REDIRECT_ENABLED = True METAREFRESH_ENABLED = True # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'manta (+http://www.yourdomain.com)' #USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1" USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)" ; DEFAULT_REQUEST_HEADERS = { #'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept' : 'text/html, application/xhtml+xml, */*' , #'Accept-Language': 'en', 'Accept-Language' : 'en-US' , #'X-JAVASCRIPT-ENABLED': 'true', "Cache-Control" : "no-cache" , "Connection" : "Keep-Alive" , "UA-CPU" : "AMD64" , "Accept-Encoding" : "gzip, deflate" , } DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : 400 , #'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580, 'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware' : 600 , 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware' : 700 , 'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware' : 900 , } #COOKIES_DEBUG=True |
还是没工作。
【总结】
Scrapy,还是足够复杂,对于某url返回的js中带redirect的事情,估计还是可以用middleware实现的,只是现在自己不知道如何实现。
转载请注明:在路上 » 【记录】用Scrapy抓取manta.com