最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Mac中PyCharm中去加断点实时调试scrapy的项目

Mac crifan 4333浏览 0评论

折腾:

【记录】用Python的Scrapy去爬取cbeebies.com

后,在之前已经用过PyCharm去加上断点实时调试别的python程序的前提下。

下载希望可以用PyCharm也可以加断点,实时调试scrapy的项目。

不过能想到的是,scrapy这种项目,应该属于稍微特殊点的,毕竟可能涉及到多线程同时运行,以及在不同目录下执行scrapy还会有不同的命令(比如在项目根目录下scrapy —help,才能看到crawl子命令)

不过在用PyCharm去调试Python的Scrapy之前,先去解决:

【已解决】Mac中PyCharm中找不到已安装的Scrapy库: No module named scrapy

加上断点:

接着去设置调试配置

点击+

通过之前在命令行中可以正常运行的命令:

➜  cbeebies scrapy crawl Cbeebies

2018-01-09 20:29:26 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: cbeebies)

‘scheduler/enqueued/memory’: 4,

‘start_time’: datetime.datetime(2018, 1, 9, 12, 29, 27, 127494)}

2018-01-09 20:29:39 [scrapy.core.engine] INFO: Spider closed (finished)

➜  cbeebies pwd

/Users/crifan/dev/dev_root/company/naturling/projects/scrapy/cbeebies

➜  cbeebies ll

total 48

drwxr-xr-x  10 crifan  staff   320B 12 26 23:07 cbeebies

-rw-r–r–   1 crifan  staff    19K  1  9 20:29 global.cbeebies.com.html

-rw-r–r–   1 crifan  staff   260B 12 26 22:50 scrapy.cfg

所以去设置对应的路径和命令:

结果发现选择目录后,script path还是空的:

并没有看到以为的:

script path自动变为:/Users/crifan/dev/dev_root/company/naturling/projects/scrapy/cbeebies

-》好像是必须要指定py文件的

现在变成了:

命令行下面是:

在路径:/Users/crifan/dev/dev_root/company/naturling/projects/scrapy/cbeebies

中执行:scrapy crawl Cbeebies

并不是之前的普通的可以指定某个py文件作为入口的

不知道如何运行了。

pycharm debug scrapy

PyCharm下进行Scrapy项目的调试 – 掘金

Debug Scrapy? – IDEs Support (IntelliJ Platform) | JetBrains

Anyone use PyCharm for Scrapy spider development? – IDEs Support (IntelliJ Platform) | JetBrains

python – How to use PyCharm to debug Scrapy projects – Stack Overflow

scrapy其实是个py->找到对应的cmdline.py的完整路径,即可再去加上参数,运行了

➜  cbeebies which scrapy

/usr/local/bin/scrapy

➜  cbeebies ll /usr/local/bin/scrapy

-rwxr-xr-x  1 crifan  admin   242B 12 26 20:49 /usr/local/bin/scrapy

➜  cbeebies cat /usr/local/bin/scrapy

#!/usr/local/opt/python/bin/python2.7

# -*- coding: utf-8 -*-

import re

import sys

from scrapy.cmdline import execute

if __name__ == ‘__main__’:

    sys.argv[0] = re.sub(r'(-script\.pyw?|\.exe)?$’, ”, sys.argv[0])

    sys.exit(execute())

对于cmdline,是参考目前已知的Python解析器:

和:

去找到:

➜  cbeebies ll /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/bin/python2.7

-rwxr-xr-x  1 crifan  admin    13K  5  6  2017 /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/bin/python2.7

➜  cbeebies ll /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/

total 2912

lrwxr-xr-x   1 crifan  admin    17B  5  6  2017 Headers -> include/python2.7

-r-xr-xr-x   1 crifan  admin   1.4M  5  6  2017 Python

drwxr-xr-x   4 crifan  admin   128B  5  6  2017 Resources

drwxr-xr-x  23 crifan  admin   736B  5  6  2017 bin

drwxr-xr-x   3 crifan  admin    96B  5  6  2017 include

drwxr-xr-x   5 crifan  admin   160B  5  6  2017 lib

➜  cbeebies ll /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib

total 0

lrwxr-xr-x    1 crifan  admin     9B  5  6  2017 libpython2.7.dylib -> ../Python

drwxr-xr-x    5 crifan  admin   160B  5  6  2017 pkgconfig

drwxr-xr-x  650 crifan  admin    20K  5  6  2017 python2.7

➜  cbeebies ll /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7

total 26096

-rw-r–r–     1 crifan  admin    22K  5  6  2017 BaseHTTPServer.py

lrwxr-xr-x     1 crifan  admin    54B  5  6  2017 site-packages -> ../../../../../../../../../lib/python2.7/site-packages

然后对于

➜  cbeebies ll /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages

lrwxr-xr-x  1 crifan  admin    54B  5  6  2017 /usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages -> ../../../../../../../../../lib/python2.7/site-packages

继续去找,才找到:

➜  cbeebies ll /usr/local/lib/python2.7/site-packages/

Automat-0.6.0.dist-info/                      exampleproj/                                  requests-2.18.4.dist-info/

OpenSSL/                                      hyperlink/                                    scrapy/

PyDispatcher-2.0.5.dist-info/                 hyperlink-17.3.1.dist-info/                   selenium/

➜  cbeebies ll /usr/local/lib/python2.7/site-packages/scrapy

total 824

-rw-r–r–   1 crifan  admin     6B 12 26 20:41 VERSION

-rw-r–r–   1 crifan  admin   1.1K 12 26 20:41 __init__.py

-rw-r–r–   1 crifan  admin   1.6K 12 26 20:49 __init__.pyc

-rw-r–r–   1 crifan  admin    77B 12 26 20:41 __main__.py

-rw-r–r–   1 crifan  admin   300B 12 26 20:49 __main__.pyc

-rw-r–r–   1 crifan  admin   990B 12 26 20:41 _monkeypatches.py

-rw-r–r–   1 crifan  admin   956B 12 26 20:49 _monkeypatches.pyc

-rw-r–r–   1 crifan  admin   5.6K 12 26 20:41 cmdline.py

-rw-r–r–   1 crifan  admin   7.1K 12 26 20:49 cmdline.pyc

-rw-r–r–   1 crifan  admin   259B 12 26 20:41 command.py

-rw-r–r–   1 crifan  admin   465B 12 26 20:49 command.pyc

drwxr-xr-x  32 crifan  admin   1.0K 12 26 20:49 commands

-rw-r–r–   1 crifan  admin   486B 12 26 20:41 conf.py

-rw-r–r–   1 crifan  admin   602B 12 26 20:49 conf.pyc

drwxr-xr-x   4 crifan  admin   128B 12 26 20:49 templates

drwxr-xr-x  80 crifan  admin   2.5K 12 26 20:49 utils

drwxr-xr-x   8 crifan  admin   256B 12 26 20:49 xlib

➜  cbeebies ll /usr/local/lib/python2.7/site-packages/scrapy/cmdline.py

-rw-r–r–  1 crifan  admin   5.6K 12 26 20:41 /usr/local/lib/python2.7/site-packages/scrapy/cmdline.py

然后看看:

➜  cbeebies cat /usr/local/lib/python2.7/site-packages/scrapy/cmdline.py

from __future__ import print_function

import sys, os

import optparse

import cProfile

import inspect

import pkg_resources

import scrapy

from scrapy.crawler import CrawlerProcess

from scrapy.commands import ScrapyCommand

from scrapy.exceptions import UsageError

from scrapy.utils.misc import walk_modules

from scrapy.utils.project import inside_project, get_project_settings

from scrapy.settings.deprecated import check_deprecated_settings

def _iter_command_classes(module_name):

    # TODO: add `name` attribute to commands and and merge this function with

    # scrapy.utils.spider.iter_spider_classes

    for module in walk_modules(module_name):

        for obj in vars(module).values():

            if inspect.isclass(obj) and \

                    issubclass(obj, ScrapyCommand) and \

                    obj.__module__ == module.__name__ and \

                    not obj == ScrapyCommand:

                yield obj

if __name__ == ‘__main__’:

    execute()

➜  cbeebies

然后可以继续去配置了:

  • script path:/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py

  • parameter: crawl Cbeebies

再去调试试试:

结果出错:

/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py –multiproc –qt-support=auto –client 127.0.0.1 –port 63858 –file /usr/local/lib/python2.7/site-packages/scrapy/cmdline.py crawl Cbeebies

pydev debugger: process 70256 is connecting

Connected to pydev debugger (build 173.3727.137)

Scrapy 1.4.0 – no active project

Unknown command: crawl

Use “scrapy” to see available commands

Process finished with exit code 2

好像说是:没有指定对应的项目?

参考:

https://stackoverflow.com/questions/21788939/how-to-use-pycharm-to-debug-scrapy-projects

好像是需要设置:working directory的

去加上:

然后再去调试:

果然可以开始调试运行了:

然后到断点就停下来了:

效果真心好用啊。

接下来就可以继续畅快的去调试了。

注:

另外可以参考:

pycharm下虚拟环境执行并调试scrapy爬虫程序 – 简书

pycharm下打开、执行并调试scrapy爬虫程序 – 博客堂 – CSDN博客

Crawling with Scrapy – How to Debug Your Spider – Scraping Authority

去创建自己的scrapy的运行脚本,利用cmdline去运行,理论上应该也是可以的。就不去试了。

转载请注明:在路上 » 【已解决】Mac中PyCharm中去加断点实时调试scrapy的项目

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
82 queries in 0.183 seconds, using 22.08MB memory