最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中PyQuery如何匹配某个元素中多个属性值

ID crifan 3917浏览 0评论
折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,希望对于:
<dl id="33" olr="6">

<dl id="34" olr="65">
。。。
能用PyQuery去匹配到:
dl,的id属性和orl属性
最好能用正则类的写法 \d+
实在没有也可以支持 id*=”” 之类的写法
以及 同时能指定2个属性:id和olr
目前写成:
dlListDoc = response.doc('dl[id and orl]').items()
结果报错:
[E 200816 09:47:57 base_handler:203] Operator expected, got <IDENT 'and' at 6>
    Traceback (most recent call last):
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<autohome_20200814>", line 75, in gradCarHtmlPage
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 300, in __call__
        result = self._copy(*args, parent=self, **kwargs)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 286, in _copy
        return self.__class__(*args, **kwargs)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 271, in __init__
        xpath = self._css_to_xpath(selector)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 282, in _css_to_xpath
        return self._translator.css_to_xpath(selector, prefix)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/xpath.py", line 192, in css_to_xpath
        for selector in parse(css))
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 415, in parse
        return list(parse_selector_group(stream))
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 428, in parse_selector_group
        yield Selector(*parse_selector(stream))
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 436, in parse_selector
        result, pseudo_element = parse_simple_selector(stream)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 498, in parse_simple_selector
        result = parse_attrib(result, stream)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 598, in parse_attrib
        "Operator expected, got %s" % (next,))
      File "<string>", line None
    cssselect.parser.SelectorSyntaxError: Operator expected, got <IDENT 'and' at 6>
去找了xpath
Xpath cheatsheet
XPath Nodes
https://www.w3schools.com/xml/xpath_nodes.asp
XPath Syntax
或许用 | ?
XPath语法详解_zgkli6com的专栏-CSDN博客_xpath 语法
XPath 语法 – [ XPath 参考手册 ] – 在线原生手册 – php中文网
Python爬虫:Xpath语法笔记 – 简书
XPath语法 – 简书
XPath 语法
突然想到 此处好像不是xpath
而是jQuery的语法
pyquery match multiple property
jquery match multiple property
Multiple Attribute Selector [name=”value”][name2=”value2″] | jQuery API Documentation
好像是:
[name=”value”][name2=”value2″]
Multiple Selector (“selector1, selector2, selectorN”) | jQuery API Documentation
顺带:多个元素,则是逗号分割:
$( "div, p, span" )
jquery – How do I select elements on multiple attribute values – Stack Overflow
https://stackoverflow.com/questions/8045071/how-do-i-select-elements-on-multiple-attribute-values
$('div[attr1="value1"][attr2="value2"]')
是这么写的
但是问题来了:
此处想要实现:
id=\d+
如何实现?
最不济:
id=”*”
jquery match property regex
javascript – jQuery selector regular expressions – Stack Overflow
$("div:regex(class, .*sd.*)")
javascript – Regex in Jquery Selectors – Stack Overflow
去试试
        dlListDoc = response.doc('dl[id][orl]').items()
        print("type(dlListDoc)=%s" % type(dlListDoc))
        print("len(dlListDoc)=%s" % len(dlListDoc))
        print("dlListDoc=%s" % dlListDoc)
结果:
        dlListDoc = response.doc('dl[id][orl]').items()
        print("type(dlListDoc)=%s" % type(dlListDoc))
        dlList = list(dlListDoc)
        print("len(dlList)=%s" % len(dlList))
        print("dlList=%s" % dlList)
结果:
匹配不到元素:
type(dlListDoc)=<class 'generator'>
len(dlList)=0
dlList=[]
不对
dlListDoc = response.doc("dl[id*=''][orl*='']").items()
结果:
还是没找到。
dlListDoc = response.doc("dl[orl*='']").items()
结果:没找到。
Traversing — pyquery 1.2.4 documentation
对于id可以用 .xxx
>>> d =pq('<p id="hello" class="hello"><a/></p><p id="test"><a/></p>')
>>> d('p').filter('.hello')
[<p#hello.hello>]
但是此处id值不固定是数组
没法直接写
dlListDoc = response.doc("dl").items()
结果:
type(dlListDoc)=<class 'generator'>
len(dlList)=22
dlList=[[<dl#33>], [<dl#35>], [<dl#34>], [<dl#378>], [<dl#327>], [<dl#134>], [<dl#117>], [<dl#354>], [<dl#292>], [<dl#276>], [<dl#410>], [<dl#253>], [<dl#251>], [<dl#272>], [<dl#310>], [<dl#424>], [<dl#397>], [<dl#303>], [<dl#340>], [<dl#431>], [<dl#273>], [<dl#221>]]
是正常可以的
Response – pyspider
“Response.doc
A PyQuery object of the response’s content. Links have made as absolute by default.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/”
只能参考PyQuery了
pyquery: a jquery-like library for python — pyquery 1.2.4 documentation
https://pythonhosted.org/pyquery/
pyquery – PyQuery complete API — pyquery 1.2.4 documentation
Attributes — pyquery 1.2.4 documentation
Traversing — pyquery 1.2.4 documentation
另外
jQuery
https://jquery.com
但是都没有找到可以同时匹配
id和olr
的属性的写法
去试试:
dlListDoc = response.doc("dl:regex(id, \d+)").items()
结果:
语法错误
    cssselect.parser.SelectorSyntaxError: Expected an argument, got <DELIM ',' at 11>
看来不支持regex这种写法
dlListDoc = response.doc("dl:regex(id,[0-9])").items()
结果:
错误依旧
dlListDoc = response.doc("dl[id]").items()
结果:
是可以的
dlListDoc = response.doc("dl[olr]").items()
结果:也是可以的。
结论:
PySpider中PyQuery中,无法实现:
检测dl的2个属性,id和olr,且都是数字 实在不行,表示任何值也可以
只可惜 都不支持
最后只能用单个的属性了:
dlListDoc = response.doc("dl[id]").items()
或:
dlListDoc = response.doc("dl[olr]").items()
凑合用吧。

转载请注明:在路上 » 【已解决】PySpider中PyQuery如何匹配某个元素中多个属性值

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
80 queries in 0.178 seconds, using 22.17MB memory