安装了Scrapy之后,就去按照官网教程:
去试试。
1.通过
scrapy startproject tutorial
创建了一个新项目。
2.参考其代码,把items.py改为其所说的值。
3.新建了dmoz_spider.py,写上教程中所给的代码。
但是接下来,很悲催的是,教程中,居然没有说明“dmoz/spiders”中的dmoz,是位于什么位置,又是何时创建的文件夹。
实在不行,只有自己先去试试了。
先在和scrapy.cfg和tutorial文件夹同级的位置,建立了一个dmoz,然后在其下建立spiders文件夹,把dmoz_spider.py放进去。
然后去运行,结果出错了:
E:\Dev_Root\python\Scrapy>cd tutorial E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz 2012-11-11 19:47:27+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial) 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats , SpiderState 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi ddleware, ChunkedTransferMiddleware, DownloaderStats 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd leware, UrlLengthMiddleware, DepthMiddleware 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled item pipelines: Traceback (most recent call last): File "E:\dev_install_root\Python27\lib\runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "E:\dev_install_root\Python27\lib\runpy.py", line 72, in _run_code exec code in run_globals File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 156, in <module> execute() File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 131, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 76, in _run_print_help func(*a, **kw) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py", line 138, in _run_command cmd.run(args, opts) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\commands\crawl.py", line 43, in run spider = self.crawler.spiders.create(spname, **opts.spargs) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\spidermanager.py", line 43, in create raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: dmoz'
坑爹的教程啊,很明显没有把路径解释清楚。
后来参考:
scrapy newbie: tutorial. error when running scrapy crawl dmoz
然后把dmoz_spider.py放到tutorial/tutorial/spiders下面,然后重新运行,就可以了:
E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz 2012-11-11 19:51:40+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial) 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats , SpiderState 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi ddleware, ChunkedTransferMiddleware, DownloaderStats 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd leware, UrlLengthMiddleware, DepthMiddleware 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled item pipelines: 2012-11-11 19:51:40+0800 [dmoz] INFO: Spider opened 2012-11-11 19:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Boo ks/> (referer: None) 2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Res ources/> (referer: None) 2012-11-11 19:51:41+0800 [dmoz] INFO: Closing spider (finished) 2012-11-11 19:51:41+0800 [dmoz] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 530, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 13061, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2012, 11, 11, 11, 51, 41, 506000), 'log_count/DEBUG': 8, 'log_count/INFO': 4, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2012, 11, 11, 11, 51, 40, 630000)} 2012-11-11 19:51:41+0800 [dmoz] INFO: Spider closed (finished)
Scrapy这个项目,貌似文档方面,还是做的很不到位啊。
连最基本的这个教程,竟然路径方面都解释的很不清楚,让人产生混淆。真的很假。。。
4.后来,就是继续安装教程所给的代码,去测试了一下,最后的一次是通过代码dmoz_spider.py:
from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select('//ul/li') items = [] for site in sites: item = DmozItem() item['title'] = site.select('a/text()').extract() item['link'] = site.select('a/@href').extract() item['desc'] = site.select('text()').extract() items.append(item) return items
运行:
scrapy crawl dmoz -o items.json -t json
获得了输出的items.json:
[{"desc": ["\n "], "link": ["/"], "title": ["Top"]}, {"desc": [], "link": ["/Computers/"], "title": ["Computers"]}, {"desc": [], "link": ["/Computers/Programming/"], "title": ["Programming"]}, {"desc": [], "link": ["/Computers/Programming/Languages/"], "title": ["Languages"]}, {"desc": [], "link": ["/Computers/Programming/Languages/Python/"], "title": ["Python"]}, {"desc": ["\n \t", "\u00a0", "\n "], "link": [], "title": []}, {"desc": ["\n ", " \n ", "\n "], "link": ["/Computers/Programming/Languages/Python/Resources/"], "title": ["Computers: Programming: Languages: Python: Resources"]}, ... ]
【总结】
貌似大概看了下其给出的一些链接,貌似Scrapy,功能还是很强大的。
剩下的,就是有空再去看看
其中有几乎所有的内容,值得折腾折腾。
转载请注明:在路上 » 【记录】折腾Scrapy的Tutorial