折腾:
【已解决】pyspider中如何加载汽车之家页面中的更多内容
期间,就遇到了部分页面内容不显示,好像是js导致的。当时是规避掉了。
不过后面继续遇到了类似的问题:
pyspider中,加载的页面,缺少了部分内容:
原始的浏览器直接打开页面是可以显示的:
pyspider 部分内容不显示
pyspider part html not show
python – pyspider下无法web预览页面 – SegmentFault 思否
Level 2: AJAX and More HTTP – pyspider
Frequently Asked Questions – pyspider
并没有我们要的
pyspider load js
def on_start(self): self.crawl('http://www.example.org/', callback=self.callback, fetch_type='js', js_script=''' function() { window.scrollTo(0,document.body.scrollHeight); return 123; } ''')
默认是:
js_run_at = document-end
但是要把js写进去
Level 3: Render with PhantomJS – pyspider
然后去试试:
➜ AutocarData pyspider phantomjs phantomjs fetcher running on port 25555
然后代码中加上js
结果
进去不了:
很明显,对于:
pyspider phantomjs
没有继续输出和之前类似的:
➜ AutocarData pyspider phantomjs fetcher running on port 25555 [I 180503 21:20:04 result_worker:49] result_worker starting... [I 180503 21:20:05 processor:211] processor starting... [I 180503 21:20:05 tornado_fetcher:638] fetcher starting... [I 180503 21:20:05 scheduler:647] scheduler starting... [I 180503 21:20:05 scheduler:126] project autohomeBrandData updated, status:TODO, paused:False, 0 tasks [I 180503 21:20:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180503 21:20:05 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180503 21:20:05 app:76] webui running on 0.0.0.0:5000 [I 180503 21:20:11 tornado_fetcher:188] [200] autohomeBrandData:data:,on_start data:,on_start 0s [I 180503 21:20:14 tornado_fetcher:419] [200] autohomeBrandData:a5a0d52b797d5c51e5edbadd91f4fed9 https://www.autohome.com.cn/grade/carhtml/b.html 0.40s [I 180503 21:20:20 tornado_fetcher:419] [200] autohomeBrandData:e33fb0c1b59473c3b497397c1651fcf9 https://car.autohome.com.cn/pic/series/3248.html#pvareaid=103448 0.09s [I 180503 21:20:27 tornado_fetcher:419] [200] autohomeBrandData:e01cd2a8d1c898b71b9de564a8a90ad9 https://www.autohome.com.cn/spec/33986/#pvareaid=2042128 0.05s [I 180503 21:21:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180503 21:22:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180503 21:23:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
所以此处才打不开页面的。
pyspider 爬虫教程(三):使用 PhantomJS 渲染带 JS 的页面 | Binuxの杂货铺
About phantomjs can’t automatically load javascript web page. · Issue #718 · binux/pyspider
pyspider中运行JS遇到的问题,即self.crawl中js_script的问题? – djytwy的回答 – SegmentFault 思否
Pyspider使用Selenium+Chrome实现爬取js动态页面 – 简书
pyspider phantomjs not work
pyspider使用phantomjs,webui调试没问题,运行不执行,只有第一个index_page的调用。 – TuChief的回答 – SegmentFault 思否
去看了看help:
➜ AutocarData pyspider --help Usage: pyspider [OPTIONS] COMMAND [ARGS]... A powerful spider system in python. Options: -c, --config FILENAME a json file with default values for subcommands. {"webui": {"port":5001}} --logging-config TEXT logging config file for built-in python logging module [default: /Users/crifan/.loc al/share/virtualenvs/AutocarData-xI- iqIq4/lib/python3.6/site- packages/pyspider/logging.conf] --debug debug mode --queue-maxsize INTEGER maxsize of queue --taskdb TEXT database url for taskdb, default: sqlite --projectdb TEXT database url for projectdb, default: sqlite --resultdb TEXT database url for resultdb, default: sqlite --message-queue TEXT connection url to message queue, default: builtin multiprocessing.Queue --amqp-url TEXT [deprecated] amqp url for rabbitmq. please use --message-queue instead. --beanstalk TEXT [deprecated] beanstalk config for beanstalk queue. please use --message-queue instead. --phantomjs-proxy TEXT phantomjs proxy ip:port --data-path TEXT data dir path --add-sys-path / --not-add-sys-path add current working directory to python lib search path --version Show the version and exit. --help Show this message and exit. Commands: all Run all the components in subprocess or... bench Run Benchmark test. fetcher Run Fetcher. one One mode not only means all-in-one, it runs... phantomjs Run phantomjs fetcher if phantomjs is... processor Run Processor. result_worker Run result worker. scheduler Run Scheduler, only one scheduler is allowed. send_message Send Message to project from command line webui Run WebUI
然后试试all模式:
➜ AutocarData pyspider all phantomjs fetcher running on port 25555 [I 180503 21:48:32 result_worker:49] result_worker starting... [I 180503 21:48:33 processor:211] processor starting... [I 180503 21:48:33 tornado_fetcher:638] fetcher starting... [I 180503 21:48:33 scheduler:647] scheduler starting... [I 180503 21:48:33 scheduler:126] project autohomeBrandData updated, status:TODO, paused:False, 0 tasks [I 180503 21:48:33 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180503 21:48:33 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180503 21:48:33 app:76] webui running on 0.0.0.0:5000
看看是否可以正常运行:
phantomjs
webui页面是出来了:
就是不知道后续js是否有效,去试试
no web interface · Issue #379 · binux/pyspider
“if you have phantomjs installed, in all mode, it should enable phantomjs automatically. If you are not using all mode, yes.”
记得别处也是这么说的:
all模式的话,会启用phantomjs,如果安装了的话,我此处安装了的。
pyspider示例代码一:利用phantomjs解决js问题 – microman – 博客园
web crawler – Fail to scrape images with pyspider and phantomjs – Stack Overflow
去加上:
http://docs.pyspider.org/en/latest/apis/self.crawl/#fetch_type
试试,果然生效了:
web页面中可以看到想要的部分的内容了:经销商指导价
(虽然页面有点乱,估计是布局问题),但是js是执行了,所以获取到对应的内容了
且注意到,调试期间,最后加载这个页面时,比平时要慢很多:
平时不加载js的话,加载这个页面只要1秒不到;
现在用了phantomjs去执行js,使得页面内容完全显示,加载页面需要耗时:3,4秒,要慢很多。
【总结】
此处,想要pyspider加载显示js部分的html的页面的内容,需要:
1.确保自己安装了phantomjs
此处我之前已经安装好了:
~ phantomjs --version 2.1.1 ➜ ~ which phantomjs /usr/local/bin/phantomjs
2.然后运行期间,用all模式:
pyspider all
3.然后代码上,加上:fetch_type=‘js’
self.crawl(eachModelDetailDict["url"], callback=self.carModelSpecPage, fetch_type='js', save=curSerieDict) @catch_status_code_error def carModelSpecPage(self, response): print("carModelSpecPage: response=", response)
注意到:
1.用all模式运行pyspider的话,如果遇到加载了js的页面,会输出类似这种的log
[I 180503 21:56:36 tornado_fetcher:419] [200] autohomeBrandData:9e52d144c2bb09b336b80aec54dfb24b https://car.autohome.com.cn/pic/series/4764.html#pvareaid=103448 0.03s console: ua:pyspider/0.3.10 (+http://pyspider.org/) console: app_ver: console: app_key: console: os.version:undefined console: ReferenceError: Can't find variable: AHAPP console: jq2.0 console: jsbridge: version not match, apis ignored console: jsbridge: version not match, apis ignored console: jsbridge: version not match, apis ignored console: 漫游路线line数据: null console: [object Object] console: [object Object] console: startup console: 场景元素返回 [object Object] console: .......created connection........ console: toServerAddUser [object Object] null [200] https://www.autohome.com.cn/spec/34253/#pvareaid=2042128 5.552
2.如果用:
pyspider phantomjs
去运行,是可以运行,但是:
无法通过:
http://0.0.0.0:5000/
去访问webui页面
转载请注明:在路上 » 【已解决】PySpider中页面部分内容不显示