最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中页面部分内容不显示

pyspider crifan 5464浏览 0评论

折腾:

【已解决】pyspider中如何加载汽车之家页面中的更多内容

期间,就遇到了部分页面内容不显示,好像是js导致的。当时是规避掉了。

不过后面继续遇到了类似的问题:

pyspider中,加载的页面,缺少了部分内容:

原始的浏览器直接打开页面是可以显示的:

pyspider 部分内容不显示

pyspider part html not show

python – pyspider下无法web预览页面 – SegmentFault 思否

Level 2: AJAX and More HTTP – pyspider

Frequently Asked Questions – pyspider

并没有我们要的

pyspider load js

self.crawl – pyspider

def on_start(self): 
    self.crawl('http://www.example.org/', callback=self.callback, fetch_type='js', js_script=''' function() { window.scrollTo(0,document.body.scrollHeight); return 123; } ''')

默认是:

js_run_at = document-end

但是要把js写进去

Level 3: Render with PhantomJS – pyspider

然后去试试:

➜  AutocarData pyspider phantomjs
phantomjs fetcher running on port 25555

然后代码中加上js

结果

http://0.0.0.0:5000/

进去不了:

很明显,对于:

pyspider phantomjs

没有继续输出和之前类似的:

➜  AutocarData pyspider
phantomjs fetcher running on port 25555
[I 180503 21:20:04 result_worker:49] result_worker starting...
[I 180503 21:20:05 processor:211] processor starting...
[I 180503 21:20:05 tornado_fetcher:638] fetcher starting...
[I 180503 21:20:05 scheduler:647] scheduler starting...
[I 180503 21:20:05 scheduler:126] project autohomeBrandData updated, status:TODO, paused:False, 0 tasks
[I 180503 21:20:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180503 21:20:05 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180503 21:20:05 app:76] webui running on 0.0.0.0:5000
[I 180503 21:20:11 tornado_fetcher:188] [200] autohomeBrandData:data:,on_start data:,on_start 0s
[I 180503 21:20:14 tornado_fetcher:419] [200] autohomeBrandData:a5a0d52b797d5c51e5edbadd91f4fed9 https://www.autohome.com.cn/grade/carhtml/b.html 0.40s
[I 180503 21:20:20 tornado_fetcher:419] [200] autohomeBrandData:e33fb0c1b59473c3b497397c1651fcf9 https://car.autohome.com.cn/pic/series/3248.html#pvareaid=103448 0.09s
[I 180503 21:20:27 tornado_fetcher:419] [200] autohomeBrandData:e01cd2a8d1c898b71b9de564a8a90ad9 https://www.autohome.com.cn/spec/33986/#pvareaid=2042128 0.05s
[I 180503 21:21:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180503 21:22:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180503 21:23:05 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0

所以此处才打不开页面的。

pyspider 爬虫教程(三):使用 PhantomJS 渲染带 JS 的页面 | Binuxの杂货铺

About phantomjs can’t automatically load javascript web page. · Issue #718 · binux/pyspider

pyspider中运行JS遇到的问题,即self.crawl中js_script的问题? – djytwy的回答 – SegmentFault 思否

Pyspider使用Selenium+Chrome实现爬取js动态页面 – 简书

pyspider phantomjs not work

pyspider使用phantomjs,webui调试没问题,运行不执行,只有第一个index_page的调用。 – TuChief的回答 – SegmentFault 思否

去看了看help:

➜  AutocarData pyspider --help
Usage: pyspider [OPTIONS] COMMAND [ARGS]...

  A powerful spider system in python.

Options:
  -c, --config FILENAME           a json file with default values for
                                  subcommands. {"webui": {"port":5001}}
  --logging-config TEXT           logging config file for built-in python
                                  logging module  [default: /Users/crifan/.loc
                                  al/share/virtualenvs/AutocarData-xI-
                                  iqIq4/lib/python3.6/site-
                                  packages/pyspider/logging.conf]
  --debug                         debug mode
  --queue-maxsize INTEGER         maxsize of queue
  --taskdb TEXT                   database url for taskdb, default: sqlite
  --projectdb TEXT                database url for projectdb, default: sqlite
  --resultdb TEXT                 database url for resultdb, default: sqlite
  --message-queue TEXT            connection url to message queue, default:
                                  builtin multiprocessing.Queue
  --amqp-url TEXT                 [deprecated] amqp url for rabbitmq. please
                                  use --message-queue instead.
  --beanstalk TEXT                [deprecated] beanstalk config for beanstalk
                                  queue. please use --message-queue instead.
  --phantomjs-proxy TEXT          phantomjs proxy ip:port
  --data-path TEXT                data dir path
  --add-sys-path / --not-add-sys-path
                                  add current working directory to python lib
                                  search path
  --version                       Show the version and exit.
  --help                          Show this message and exit.

Commands:
  all            Run all the components in subprocess or...
  bench          Run Benchmark test.
  fetcher        Run Fetcher.
  one            One mode not only means all-in-one, it runs...
  phantomjs      Run phantomjs fetcher if phantomjs is...
  processor      Run Processor.
  result_worker  Run result worker.
  scheduler      Run Scheduler, only one scheduler is allowed.
  send_message   Send Message to project from command line
  webui          Run WebUI

然后试试all模式:

➜  AutocarData pyspider all
phantomjs fetcher running on port 25555
[I 180503 21:48:32 result_worker:49] result_worker starting...
[I 180503 21:48:33 processor:211] processor starting...
[I 180503 21:48:33 tornado_fetcher:638] fetcher starting...
[I 180503 21:48:33 scheduler:647] scheduler starting...
[I 180503 21:48:33 scheduler:126] project autohomeBrandData updated, status:TODO, paused:False, 0 tasks
[I 180503 21:48:33 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180503 21:48:33 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180503 21:48:33 app:76] webui running on 0.0.0.0:5000

看看是否可以正常运行:

phantomjs

webui页面是出来了:

就是不知道后续js是否有效,去试试

no web interface · Issue #379 · binux/pyspider

“if you have phantomjs installed, in all mode, it should enable phantomjs automatically. If you are not using all mode, yes.”

记得别处也是这么说的:

all模式的话,会启用phantomjs,如果安装了的话,我此处安装了的。

pyspider示例代码一:利用phantomjs解决js问题 – microman – 博客园

web crawler – Fail to scrape images with pyspider and phantomjs – Stack Overflow

去加上:

http://docs.pyspider.org/en/latest/apis/self.crawl/#fetch_type

试试,果然生效了:

web页面中可以看到想要的部分的内容了:经销商指导价

(虽然页面有点乱,估计是布局问题),但是js是执行了,所以获取到对应的内容了

且注意到,调试期间,最后加载这个页面时,比平时要慢很多:

平时不加载js的话,加载这个页面只要1秒不到;

现在用了phantomjs去执行js,使得页面内容完全显示,加载页面需要耗时:3,4秒,要慢很多。

【总结】

此处,想要pyspider加载显示js部分的html的页面的内容,需要:

1.确保自己安装了phantomjs

此处我之前已经安装好了:

~ phantomjs --version
2.1.1
➜  ~ which phantomjs
/usr/local/bin/phantomjs

2.然后运行期间,用all模式:

pyspider all

3.然后代码上,加上:fetch_type=‘js’

            self.crawl(eachModelDetailDict["url"], callback=self.carModelSpecPage, fetch_type='js', save=curSerieDict)
    
    @catch_status_code_error
    def carModelSpecPage(self, response):
        print("carModelSpecPage: response=", response)

注意到:

1.用all模式运行pyspider的话,如果遇到加载了js的页面,会输出类似这种的log

[I 180503 21:56:36 tornado_fetcher:419] [200] autohomeBrandData:9e52d144c2bb09b336b80aec54dfb24b https://car.autohome.com.cn/pic/series/4764.html#pvareaid=103448 0.03s
console: ua:pyspider/0.3.10 (+http://pyspider.org/)
console: app_ver:
console: app_key:
console: os.version:undefined
console: ReferenceError: Can't find variable: AHAPP
console: jq2.0
console: jsbridge: version not match, apis ignored
console: jsbridge: version not match, apis ignored
console: jsbridge: version not match, apis ignored
console: 漫游路线line数据: null
console: [object Object]
console: [object Object]
console: startup
console: 场景元素返回 [object Object]
console: .......created connection........
console: toServerAddUser [object Object] null
[200] https://www.autohome.com.cn/spec/34253/#pvareaid=2042128 5.552

2.如果用:

pyspider phantomjs

去运行,是可以运行,但是:

无法通过:

http://0.0.0.0:5000/

去访问webui页面

转载请注明:在路上 » 【已解决】PySpider中页面部分内容不显示

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
84 queries in 0.178 seconds, using 22.09MB memory