折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,希望对于:
<dl id="33" olr="6"> <dl id="34" olr="65"> 。。。
能用PyQuery去匹配到:
dl,的id属性和orl属性
最好能用正则类的写法 \d+
实在没有也可以支持 id*=”” 之类的写法
以及 同时能指定2个属性:id和olr
目前写成:
dlListDoc = response.doc('dl[id and orl]').items()
结果报错:
[E 200816 09:47:57 base_handler:203] Operator expected, got <IDENT 'and' at 6> Traceback (most recent call last): File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task result = self._run_task(task, response) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task return self._run_func(function, response, task) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func ret = function(*arguments[:len(args) - 1]) File "<autohome_20200814>", line 75, in gradCarHtmlPage File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 300, in __call__ result = self._copy(*args, parent=self, **kwargs) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 286, in _copy return self.__class__(*args, **kwargs) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 271, in __init__ xpath = self._css_to_xpath(selector) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py", line 282, in _css_to_xpath return self._translator.css_to_xpath(selector, prefix) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/xpath.py", line 192, in css_to_xpath for selector in parse(css)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 415, in parse return list(parse_selector_group(stream)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 428, in parse_selector_group yield Selector(*parse_selector(stream)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 436, in parse_selector result, pseudo_element = parse_simple_selector(stream) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 498, in parse_simple_selector result = parse_attrib(result, stream) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py", line 598, in parse_attrib "Operator expected, got %s" % (next,)) File "<string>", line None cssselect.parser.SelectorSyntaxError: Operator expected, got <IDENT 'and' at 6>
去找了xpath
XPath Nodes
或许用 | ?
突然想到 此处好像不是xpath
而是jQuery的语法
pyquery match multiple property
jquery match multiple property
好像是:
[name=”value”][name2=”value2″]
顺带:多个元素,则是逗号分割:
$( "div, p, span" )
jquery – How do I select elements on multiple attribute values – Stack Overflow
$('div[attr1="value1"][attr2="value2"]')
是这么写的
但是问题来了:
此处想要实现:
id=\d+
如何实现?
最不济:
id=”*”
jquery match property regex
$("div:regex(class, .*sd.*)")
去试试
dlListDoc = response.doc('dl[id][orl]').items() print("type(dlListDoc)=%s" % type(dlListDoc)) print("len(dlListDoc)=%s" % len(dlListDoc)) print("dlListDoc=%s" % dlListDoc)
结果:
dlListDoc = response.doc('dl[id][orl]').items() print("type(dlListDoc)=%s" % type(dlListDoc)) dlList = list(dlListDoc) print("len(dlList)=%s" % len(dlList)) print("dlList=%s" % dlList)
结果:
匹配不到元素:
type(dlListDoc)=<class 'generator'> len(dlList)=0 dlList=[]
不对
dlListDoc = response.doc("dl[id*=''][orl*='']").items()
结果:
还是没找到。
dlListDoc = response.doc("dl[orl*='']").items()
结果:没找到。
对于id可以用 .xxx
>>> d =pq('<p id="hello" class="hello"><a/></p><p id="test"><a/></p>') >>> d('p').filter('.hello') [<p#hello.hello>]
但是此处id值不固定是数组
没法直接写
dlListDoc = response.doc("dl").items()
结果:
type(dlListDoc)=<class 'generator'> len(dlList)=22 dlList=[[<dl#33>], [<dl#35>], [<dl#34>], [<dl#378>], [<dl#327>], [<dl#134>], [<dl#117>], [<dl#354>], [<dl#292>], [<dl#276>], [<dl#410>], [<dl#253>], [<dl#251>], [<dl#272>], [<dl#310>], [<dl#424>], [<dl#397>], [<dl#303>], [<dl#340>], [<dl#431>], [<dl#273>], [<dl#221>]]
是正常可以的
“Response.doc
A PyQuery object of the response’s content. Links have made as absolute by default.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/”
只能参考PyQuery了
但是
pyquery: a jquery-like library for python — pyquery 1.2.4 documentation
另外
jQuery
但是都没有找到可以同时匹配
id和olr
的属性的写法
去试试:
dlListDoc = response.doc("dl:regex(id, \d+)").items()
结果:
语法错误
cssselect.parser.SelectorSyntaxError: Expected an argument, got <DELIM ',' at 11>
看来不支持regex这种写法
dlListDoc = response.doc("dl:regex(id,[0-9])").items()
结果:
错误依旧
dlListDoc = response.doc("dl[id]").items()
结果:
是可以的
dlListDoc = response.doc("dl[olr]").items()
结果:也是可以的。
结论:
PySpider中PyQuery中,无法实现:
检测dl的2个属性,id和olr,且都是数字 实在不行,表示任何值也可以
只可惜 都不支持
最后只能用单个的属性了:
dlListDoc = response.doc("dl[id]").items()
或:
dlListDoc = response.doc("dl[olr]").items()
凑合用吧。