【记录】用Scrapy抓取manta.com

【问题】

手上有个Scrapy的项目，是要抓取和

http://www.manta.com/

相关的站点的内容。

已有源码为：

bs.py：

import requests
from bs4 import BeautifulSoup
 
seed_url="http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama"
 
r=requests.get(seed_url)
soup=BeautifulSoup(r.text)
 
urlls=soup.find("a","url")
 
for url in urls:
    href=url.get("href")
 
    r2=requests.get(href)
    soup2=BeautifulSoup(r2.text)

scrapy.cfg：

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# http://doc.scrapy.org/topics/scrapyd.html
 
[settings]
default = manta.settings
 
[deploy]
#url = http://localhost:6800/
project = manta

items.py：

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html
 
from scrapy.item import Item, Field
 
class MantaItem(Item):
    # define the fields for your item here like:
    # name = Field()
    pass

pipelines.py：

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html
 
class MantaPipeline(object):
    def process_item(self, item, spider):
        return item

settings.py：

# Scrapy settings for manta project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
 
#BOT_NAME = 'manta'
 
SPIDER_MODULES = ['manta.spiders']
NEWSPIDER_MODULE = 'manta.spiders'
 
BOT_NAME = 'EchO!/2.0'
 
DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = True
COOKIES_DEBUG = True
RETRY_ENABLED = False
 
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'manta (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
 
DEFAULT_REQUEST_HEADERS={
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'X-JAVASCRIPT-ENABLED': 'true',
}
 
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
}
 
 
 
 
COOKIES_DEBUG=True

很明显，核心代码是settings.py中的配置。

另外还有一些文件，总的文件架构为：

对应的，已有的返回内容responsebody，另存为html打开后，内容为：

Oops.

Before you can move on, please activate your browser cookies.

Incident Id: 51880fa5aa300

即，没有正常获取到：

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

的网页内容html的。

【解决过程】

1.得先参考：

【记录】折腾Scrapy的Tutorial

搞清楚如何运行

2. 看起来，像是settings.py的配置有误，所以先去用IE9的F12调试看看本身的逻辑：

再去尝试改settings.py，结果都不行。

但是注意到，对应的代码：

COOKIES_ENABLED = True

#COOKIES_ENABLED = False

COOKIES_DEBUG = True

运行的结果是：

2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Received cookies from: <200 http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama>

Set-Cookie: SPSI=e760b4733042a6a1291db3b406fe8bfb ; path=/; domain=.manta.com

2013-05-24 23:32:58+0800 [mantaspider] DEBUG: Crawled (200) <GET http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama> (referer: http://www.manta.com)

即开始访问：

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

只返回了一个cookie：SPSI

对此，经过调试发现，其实IE9，也是同样的效果，但是由于返回的内容中，包含有：

</script>

所以IE9浏览器中，会执行对应的reload，所以会刷新页面，重新打开：

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

然后就可以获得上面所看到的正常的网页的内容了。

对应的调试结果就是上述的逻辑：

第一次，也只是获得了单个的cookie：

其中html中包含了reload：

第二次，通过刷新：

获得了真正页面的html：

而对于如此的访问url的逻辑：

需要针对：

http://www.manta.com/mb_44_A0139_01/radio_television_and_publishers_advertising_representatives/alabama

访问两次才可以的逻辑，貌似Scrapy中，很难实现啊。

3.参考：

Scrapy错误处理meta中的refresh指令

去添加上RedirectMiddleware试试：

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
}

结果是，错误依旧。

4.参考：

Capturing http status codes with scrapy spider

貌似是可以通过自定义redirect的方式，去实现页面跳转的，但是现在还不太会。

5.截止目前，代码改的乱七八糟，如下：

# Scrapy settings for manta project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
 
#BOT_NAME = 'manta'
 
SPIDER_MODULES = ['manta.spiders']
NEWSPIDER_MODULE = 'manta.spiders'
 
BOT_NAME = 'EchO!/2.0'
 
DOWNLOAD_TIMEOUT = 15
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = True
#COOKIES_ENABLED = False
COOKIES_DEBUG = True
#RETRY_ENABLED = False
RETRY_ENABLED = True
 
REDIRECT_ENABLED = True
 
METAREFRESH_ENABLED = True
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'manta (+http://www.yourdomain.com)'
#USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
USER_AGENT = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
     
     
DEFAULT_REQUEST_HEADERS={
    #'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept': 'text/html, application/xhtml+xml, */*',
    #'Accept-Language': 'en',
    'Accept-Language': 'en-US',
    #'X-JAVASCRIPT-ENABLED': 'true',
    "Cache-Control":"no-cache",
    "Connection": "Keep-Alive",
    "UA-CPU":"AMD64",
    "Accept-Encoding":"gzip, deflate",
    "Referer":"http://www.manta.com",
         
}
 
DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
    #'scrapy.contrib.downloadermiddleware.redirect.MetaRefreshMiddleware': 580,
    'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': 600,
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware':700,
    'scrapy.contrib.downloadermiddleware.httpcache.HttpCacheMiddleware': 900,
}
 
 
#COOKIES_DEBUG=True

还是没工作。

【总结】

Scrapy，还是足够复杂，对于某url返回的js中带redirect的事情，估计还是可以用middleware实现的，只是现在自己不知道如何实现。

转载请注明：在路上 » 【记录】用Scrapy抓取manta.com

Post Views: 1,616

【记录】用Scrapy抓取manta.com

与本文相关的文章

Hi，您需要填写昵称和邮箱！

与本文相关的文章

Hi，您需要填写昵称和邮箱！

订阅在路上