安装了Scrapy之后,就去按照官网教程:
去试试。
1.通过
scrapy startproject tutorial
创建了一个新项目。
2.参考其代码,把items.py改为其所说的值。
3.新建了dmoz_spider.py,写上教程中所给的代码。
但是接下来,很悲催的是,教程中,居然没有说明“dmoz/spiders”中的dmoz,是位于什么位置,又是何时创建的文件夹。
实在不行,只有自己先去试试了。
先在和scrapy.cfg和tutorial文件夹同级的位置,建立了一个dmoz,然后在其下建立spiders文件夹,把dmoz_spider.py放进去。
然后去运行,结果出错了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | E:\Dev_Root\python\Scrapy> cd tutorial E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz 2012-11-11 19:47:27+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial) 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats , SpiderState 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi ddleware, ChunkedTransferMiddleware, DownloaderStats 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd leware, UrlLengthMiddleware, DepthMiddleware 2012-11-11 19:47:28+0800 [scrapy] DEBUG: Enabled item pipelines: Traceback (most recent call last): File "E:\dev_install_root\Python27\lib\runpy.py" , line 162, in _run_module_as_main "__main__" , fname, loader, pkg_name) File "E:\dev_install_root\Python27\lib\runpy.py" , line 72, in _run_code exec code in run_globals File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py" , line 156, in <module> execute() File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py" , line 131, in execute _run_print_help(parser, _run_command, cmd, args, opts) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py" , line 76, in _run_print_help func(*a, **kw) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\cmdline.py" , line 138, in _run_command cmd.run(args, opts) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\commands\crawl.py" , line 43, in run spider = self.crawler.spiders.create(spname, **opts.spargs) File "E:\dev_install_root\Python27\lib\site-packages\scrapy\spidermanager.py" , line 43, in create raise KeyError( "Spider not found: %s" % spider_name) KeyError: 'Spider not found: dmoz' |
坑爹的教程啊,很明显没有把路径解释清楚。
后来参考:
scrapy newbie: tutorial. error when running scrapy crawl dmoz
然后把dmoz_spider.py放到tutorial/tutorial/spiders下面,然后重新运行,就可以了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | E:\Dev_Root\python\Scrapy\tutorial>scrapy crawl dmoz 2012-11-11 19:51:40+0800 [scrapy] INFO: Scrapy 0.16.2 started (bot: tutorial) 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats , SpiderState 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi ddleware, ChunkedTransferMiddleware, DownloaderStats 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMidd leware, UrlLengthMiddleware, DepthMiddleware 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Enabled item pipelines: 2012-11-11 19:51:40+0800 [dmoz] INFO: Spider opened 2012-11-11 19:51:40+0800 [dmoz] INFO: Crawled 0 pages (at 0 pages /min ), scraped 0 items (at 0 items /min ) 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2012-11-11 19:51:40+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http: //www .dmoz.org /Computers/Programming/Languages/Python/Boo ks/> (referer: None) 2012-11-11 19:51:41+0800 [dmoz] DEBUG: Crawled (200) <GET http: //www .dmoz.org /Computers/Programming/Languages/Python/Res ources/> (referer: None) 2012-11-11 19:51:41+0800 [dmoz] INFO: Closing spider (finished) 2012-11-11 19:51:41+0800 [dmoz] INFO: Dumping Scrapy stats: { 'downloader/request_bytes' : 530, 'downloader/request_count' : 2, 'downloader/request_method_count/GET' : 2, 'downloader/response_bytes' : 13061, 'downloader/response_count' : 2, 'downloader/response_status_count/200' : 2, 'finish_reason' : 'finished' , 'finish_time' : datetime.datetime(2012, 11, 11, 11, 51, 41, 506000), 'log_count/DEBUG' : 8, 'log_count/INFO' : 4, 'response_received_count' : 2, 'scheduler/dequeued' : 2, 'scheduler/dequeued/memory' : 2, 'scheduler/enqueued' : 2, 'scheduler/enqueued/memory' : 2, 'start_time' : datetime.datetime(2012, 11, 11, 11, 51, 40, 630000)} 2012-11-11 19:51:41+0800 [dmoz] INFO: Spider closed (finished) |
Scrapy这个项目,貌似文档方面,还是做的很不到位啊。
连最基本的这个教程,竟然路径方面都解释的很不清楚,让人产生混淆。真的很假。。。
4.后来,就是继续安装教程所给的代码,去测试了一下,最后的一次是通过代码dmoz_spider.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from tutorial.items import DmozItem class DmozSpider(BaseSpider): name = "dmoz" allowed_domains = [ "dmoz.org" ] start_urls = [ ] def parse( self , response): hxs = HtmlXPathSelector(response) sites = hxs.select( '//ul/li' ) items = [] for site in sites: item = DmozItem() item[ 'title' ] = site.select( 'a/text()' ).extract() item[ 'link' ] = site.select( 'a/@href' ).extract() item[ 'desc' ] = site.select( 'text()' ).extract() items.append(item) return items |
运行:
scrapy crawl dmoz -o items.json -t json
获得了输出的items.json:
1 2 3 4 5 6 7 8 9 | [{ "desc" : [ "\n " ], "link" : [ "/" ], "title" : [ "Top" ]}, { "desc" : [], "link" : [ "/Computers/" ], "title" : [ "Computers" ]}, { "desc" : [], "link" : [ "/Computers/Programming/" ], "title" : [ "Programming" ]}, { "desc" : [], "link" : [ "/Computers/Programming/Languages/" ], "title" : [ "Languages" ]}, { "desc" : [], "link" : [ "/Computers/Programming/Languages/Python/" ], "title" : [ "Python" ]}, { "desc" : [ "\n \t" , "\u00a0" , "\n " ], "link" : [], "title" : []}, { "desc" : [ "\n " , " \n " , "\n " ], "link" : [ "/Computers/Programming/Languages/Python/Resources/" ], "title" : [ "Computers: Programming: Languages: Python: Resources" ]}, ... ] |
【总结】
貌似大概看了下其给出的一些链接,貌似Scrapy,功能还是很强大的。
剩下的,就是有空再去看看
其中有几乎所有的内容,值得折腾折腾。
转载请注明:在路上 » 【记录】折腾Scrapy的Tutorial