折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,希望对于:
1 2 3 4 | <dl id = "33" olr= "6" > <dl id = "34" olr= "65" > 。。。 |
能用PyQuery去匹配到:
dl,的id属性和orl属性
最好能用正则类的写法 \d+
实在没有也可以支持 id*=”” 之类的写法
以及 同时能指定2个属性:id和olr
目前写成:
1 | dlListDoc = response.doc( 'dl[id and orl]' ).items() |
结果报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | [E 200816 09 : 47 : 57 base_handler: 203 ] Operator expected, got <IDENT 'and' at 6 > Traceback (most recent call last): File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py" , line 196 , in run_task result = self ._run_task(task, response) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py" , line 176 , in _run_task return self ._run_func(function, response, task) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py" , line 155 , in _run_func ret = function( * arguments[: len (args) - 1 ]) File "<autohome_20200814>" , line 75 , in gradCarHtmlPage File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py" , line 300 , in __call__ result = self ._copy( * args, parent = self , * * kwargs) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py" , line 286 , in _copy return self .__class__( * args, * * kwargs) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py" , line 271 , in __init__ xpath = self ._css_to_xpath(selector) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyquery/pyquery.py" , line 282 , in _css_to_xpath return self ._translator.css_to_xpath(selector, prefix) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/xpath.py" , line 192 , in css_to_xpath for selector in parse(css)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py" , line 415 , in parse return list (parse_selector_group(stream)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py" , line 428 , in parse_selector_group yield Selector( * parse_selector(stream)) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py" , line 436 , in parse_selector result, pseudo_element = parse_simple_selector(stream) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py" , line 498 , in parse_simple_selector result = parse_attrib(result, stream) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/cssselect/parser.py" , line 598 , in parse_attrib "Operator expected, got %s" % ( next ,)) File "<string>" , line None cssselect.parser.SelectorSyntaxError: Operator expected, got <IDENT 'and' at 6 > |
去找了xpath
XPath Nodes
或许用 | ?
突然想到 此处好像不是xpath
而是jQuery的语法
pyquery match multiple property
jquery match multiple property
好像是:
1 | [name=”value”][name2=”value2″] |
顺带:多个元素,则是逗号分割:
1 | $( "div, p, span" ) |
jquery – How do I select elements on multiple attribute values – Stack Overflow
1 | $( 'div[attr1="value1"][attr2="value2"]' ) |
是这么写的
但是问题来了:
此处想要实现:
id=\d+
如何实现?
最不济:
id=”*”
jquery match property regex
1 | $( "div:regex(class, .*sd.*)" ) |
去试试
1 2 3 4 | dlListDoc = response.doc( 'dl[id][orl]' ).items() print( "type(dlListDoc)=%s" % type (dlListDoc)) print( "len(dlListDoc)=%s" % len(dlListDoc)) print( "dlListDoc=%s" % dlListDoc) |
结果:
1 2 3 4 5 | dlListDoc = response.doc( 'dl[id][orl]' ).items() print ( "type(dlListDoc)=%s" % type (dlListDoc)) dlList = list (dlListDoc) print ( "len(dlList)=%s" % len (dlList)) print ( "dlList=%s" % dlList) |
结果:
匹配不到元素:
1 2 3 | type (dlListDoc)=<class 'generator' > len(dlList)=0 dlList=[] |
不对
1 | dlListDoc = response.doc( "dl[id*=''][orl*='']" ).items() |
结果:
还是没找到。
1 | dlListDoc = response.doc( "dl[orl*='']" ).items() |
结果:没找到。
对于id可以用 .xxx
1 2 3 | >>> d = pq( '<p id="hello" class="hello"><a/></p><p id="test"><a/></p>' ) >>> d( 'p' ). filter ( '.hello' ) [<p #hello.hello>] |
但是此处id值不固定是数组
没法直接写
1 | dlListDoc = response.doc( "dl" ).items() |
结果:
1 2 3 | type (dlListDoc)=<class 'generator' > len(dlList)=22 dlList=[[<dl #33>], [<dl#35>], [<dl#34>], [<dl#378>], [<dl#327>], [<dl#134>], [<dl#117>], [<dl#354>], [<dl#292>], [<dl#276>], [<dl#410>], [<dl#253>], [<dl#251>], [<dl#272>], [<dl#310>], [<dl#424>], [<dl#397>], [<dl#303>], [<dl#340>], [<dl#431>], [<dl#273>], [<dl#221>]] |
是正常可以的
“Response.doc
A PyQuery object of the response’s content. Links have made as absolute by default.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/”
只能参考PyQuery了
但是
pyquery: a jquery-like library for python — pyquery 1.2.4 documentation
另外
jQuery
但是都没有找到可以同时匹配
id和olr
的属性的写法
去试试:
1 | dlListDoc = response.doc( "dl:regex(id, \d+)" ).items() |
结果:
语法错误
1 | cssselect.parser.SelectorSyntaxError: Expected an argument, got <DELIM ',' at 11> |
看来不支持regex这种写法
1 | dlListDoc = response.doc( "dl:regex(id,[0-9])" ).items() |
结果:
错误依旧
1 | dlListDoc = response.doc( "dl[id]" ).items() |
结果:
是可以的
1 | dlListDoc = response.doc( "dl[olr]" ).items() |
结果:也是可以的。
结论:
PySpider中PyQuery中,无法实现:
检测dl的2个属性,id和olr,且都是数字 实在不行,表示任何值也可以
只可惜 都不支持
最后只能用单个的属性了:
1 | dlListDoc = response.doc( "dl[id]" ).items() |
或:
1 | dlListDoc = response.doc( "dl[olr]" ).items() |
凑合用吧。