折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,希望从:
期间需要从:
<ul class="rank-list-ul" 0> <li id="s3170"> <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div> <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div> </li> <li id="s692"> 。。。
提取:
<li id="s3170">
结果试了多种写法:
merchantRankDoc = merchantRankDocList[curIdx] print("merchantRankDoc=%s" % merchantRankDoc) print("type(merchantRankDoc)=%s" % type(merchantRankDoc)) # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'> merchantRankHtml = merchantRankDoc.html() print("merchantRankHtml=%s" % merchantRankHtml) # <li id="s3170"> # carSeriesDocGenerator = merchantRankDoc.find("li") carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']") print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator)) # carSeriesDocGenerator = merchantRankDoc.items("li[id*=s]") # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
都无法获取到,结果此处基本上都是:
None
或只有3个子节点
通过打印得知此处是:
merchantRankDoc=<Element ul at 0x109b69c78> type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
即:
lxml.html.HtmlElement
所以,去搞清楚,如何从
lxml.html.HtmlElement的ul,获取其下多个的li
参考:
>>> print(etree.tostring(root,pretty_print=True)) <root> <child1/> <child2/> <child3/> </root> >>> children = list(root) >>> forchild inroot: ... print(child.tag) child1 child2 child3
试试:
carSeriesDocList = list(merchantRankDoc) print("carSeriesDocList=%s" % carSeriesDocList)
然后去打印html
from lxml import etree merchantRankHtml = etree.tostring(merchantRankDoc) print("merchantRankHtml=%s" % merchantRankHtml)
输出:
merchantRankHtml=b'<ul class="rank-list-ul"> \n \n <li id="s3170"> \n <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4><div>指导价:<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div> \n \n </li> \n \n 。。。 。。。 。。。 href="https://k.autohome.com.cn/812/#pvareaid=103459">口碑</a></div> \n \n </li> \n \n </ul> \n \n '
而获取其下子元素:
carSeriesDocList = list(merchantRankDoc) print("carSeriesDocList=%s" % carSeriesDocList) carSeriesDocListLen = len(carSeriesDocList) print("carSeriesDocListLen=%s" % carSeriesDocListLen)
输出:
carSeriesDocList=[<Element li at 0x109b92c28>, <Element li at 0x109b92048>, <Element li at 0x109ba2548>, <Element li at 0x109ba2b38>, <Element li at 0x109ba2048>, <Element li at 0x109ba22c8>, <Element li at 0x109ba2908>, <Element li at 0x109ba2188>, <Element li at 0x109ba26d8>, <Element li at 0x109ba2b88>, <Element li at 0x109ba2ea8>, <Element li at 0x109ba2098>, <Element li at 0x109ba2e58>, <Element li at 0x109ba2368>, <Element li at 0x109ba2138>] carSeriesDocListLen=15
好像是可以获取子节点中li元素了
但是没法直接搜索符合条件的
比如:
要的是:
<li id="s4871"> 。。。
但是不要
<li class="dashline"></li>
所以去找找,如何匹配
不过突然想起来,或许是,找找之前items返回generator,如果for循环,会不会得到的是query的对象,而不是lxml的
此处发现是:
不论是generator转为list
merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items() merchantDocList = list(merchantDocGenerator) print("merchantDocList=%s" % merchantDocList)
还是直接for循环,都是PyQuery:
type(merchantItem)=<class 'pyquery.pyquery.PyQuery'>
而不是lxml
去看看能否直接用items()加上参数
不过好像突然发现,前面一直是lxml的元素,而不是query是忘了加上items()的原因,去加上:
# merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']") merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()
结果:
就是可以的了:
merchantRankDocListLen=24
而后续想要获取子元素,没获取到,是因为笔误,改回正常的:
# carSeriesDocList = list(merchantRankDoc) carSeriesDocList = list(carSeriesDocGenerator)
至少逻辑上是对的了
然后再去看看
carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
后续输出
carSeriesDocListLen=13 -------------------------------------------------------------------------------- [0] eachCarSeriesDoc=<Element li at 0x1082f8228> type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'> carSeriesInfoDoc=<Element h4 at 0x10831d0e8>
而换成:
后续输出:
[0] eachCarSeriesDoc=<li id="s3170"> <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4><div>指导价:<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div> </li> type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'> carSeriesInfoDoc=<h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> carSeriesName=奥迪A3
就可以获取到:
其下的子元素
经过继续调试发现:
对于:
<ul class="rank-list-ul" 0> <li id="s3170"> 。。。 </li> <li id="s692"> 。。。 </li>
如果是find():
carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']") print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
则返回的是PyQuery
type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'>
然后generator转换成list后:
carSeriesDocList = list(carSeriesDocGenerator) for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList): print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc))
每个元素是:
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
如果换成items()
carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']") print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
则返回的是generator:
type(carSeriesDocGenerator)=<class 'generator'>
然后generator转换成list后:
carSeriesDocList = list(carSeriesDocGenerator) for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
每个元素是:
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
对应着官网文档中的
PyQuery.items(selector=None) Iter over elements. Return PyQuery objects:
-》items()返回的是PyQuery(=pyquery.pyquery.PyQuery)的generator
PyQuery.find(selector) Find elements using selector traversing down from self:
->find() 返回的是element元素=lxml.html.HtmlElement
【总结】
此处,对于html
<ul class="rank-list-ul" 0> <li id="s3170"> <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div> <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div> <li id="s692"> <h4><a href="//www.autohome.com.cn/692/#levelsource=000000000_0&pvareaid=101594">奥迪A4L</a></h4> <div>指导价:<a class="red" href="//www.autohome.com.cn/692/price.html#pvareaid=101446">30.58-39.68万</a></div> <div><a href="//car.autohome.com.cn/price/series-692.html#pvareaid=103446">报价</a> <a id="atk_692" href="//car.autohome.com.cn/pic/series/692.html#pvareaid=103448">图库</a> <a data-value="692" class="js-che168link" href="//www.che168.com/china/series692/">二手车</a> <a href="//club.autohome.com.cn/bbs/forum-c-692-1.html#pvareaid=103447">论坛</a> <a href="//k.autohome.com.cn/692/#pvareaid=103459">口碑</a></div> </li> 。。。
想要获取到ul其下的多个li节点
之前出各种问题,主要原因:
- 笔误
- 把变量写错了
- 不熟悉find() 和 items()返回的结果不同
- 此处希望返回PyQuery,所以应该用items()
最后代码是:
carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']") # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']") print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator)) carSeriesDocList = list(carSeriesDocGenerator) print("type(carSeriesDocList)=%s" % type(carSeriesDocList)) print("carSeriesDocList=%s" % carSeriesDocList) carSeriesDocListLen = len(carSeriesDocList) print("carSeriesDocListLen=%s" % carSeriesDocListLen) for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList): print("%s" % "-"*80) print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc)) print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'> # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a") print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc)) print("carSeriesInfoDoc=%s" % carSeriesInfoDoc) carSeriesName = carSeriesInfoDoc.text() print("carSeriesName=%s" % carSeriesName) carSeriesUrl = carSeriesInfoDoc.attr.href print("carSeriesUrl=%s" % carSeriesUrl)
输出:
type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'> type(carSeriesDocList)=<class 'list'> carSeriesDocList=[<Element li at 0x109bc3a98>, <Element li at 0x109bc36d8>, <Element li at 0x109bc3908>, <Element li at 0x109bc3b88>, <Element li at 0x109bc3e58>, <Element li at 0x109b9c908>, <Element li at 0x109bc2c78>, <Element li at 0x109bc2d68>, <Element li at 0x109bc21d8>, <Element li at 0x109bc2958>, <Element li at 0x109bc2db8>, <Element li at 0x109bc2908>, <Element li at 0x109bc27c8>] carSeriesDocListLen=13 -------------------------------------------------------------------------------- [0] eachCarSeriesDoc=<Element li at 0x109bc3a98> type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'> [E 200815 22:15:25 base_handler:203] Empty tag name Traceback (most recent call last): File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task result = self._run_task(task, response) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task return self._run_func(function, response, task) File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func ret = function(*arguments[:len(args) - 1]) File "<autohome_20200814>", line 138, in gradCarHtmlPage File "src/lxml/etree.pyx", line 1532, in lxml.etree._Element.find File "src/lxml/_elementpath.py", line 325, in lxml._elementpath.find File "src/lxml/_elementpath.py", line 102, in select File "src/lxml/_elementpath.py", line 103, in select File "src/lxml/etree.pyx", line 1437, in lxml.etree._Element.iterchildren File "src/lxml/etree.pyx", line 2841, in lxml.etree.ElementChildIterator.__cinit__ File "src/lxml/etree.pyx", line 2812, in lxml.etree._ElementMatchIterator._initTagMatcher File "src/lxml/etree.pyx", line 2679, in lxml.etree._MultiTagMatcher.__cinit__ File "src/lxml/etree.pyx", line 2718, in lxml.etree._MultiTagMatcher.initTagMatch File "src/lxml/etree.pyx", line 2749, in lxml.etree._MultiTagMatcher._storeTags File "src/lxml/etree.pyx", line 2736, in lxml.etree._MultiTagMatcher._storeTags File "src/lxml/apihelpers.pxi", line 1657, in lxml.etree._getNsTag File "src/lxml/apihelpers.pxi", line 1692, in lxml.etree.__getNsTag ValueError: Empty tag name
把find() 换 items():
carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']") print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator)) carSeriesDocList = list(carSeriesDocGenerator) print("type(carSeriesDocList)=%s" % type(carSeriesDocList)) print("carSeriesDocList=%s" % carSeriesDocList) carSeriesDocListLen = len(carSeriesDocList) print("carSeriesDocListLen=%s" % carSeriesDocListLen) for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList): print("%s" % "-"*80) print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc)) print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'> # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a") print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc)) print("carSeriesInfoDoc=%s" % carSeriesInfoDoc) carSeriesName = carSeriesInfoDoc.text() print("carSeriesName=%s" % carSeriesName) carSeriesUrl = carSeriesInfoDoc.attr.href print("carSeriesUrl=%s" % carSeriesUrl)
就正常了
type(carSeriesDocGenerator)=<class 'generator'> type(carSeriesDocList)=<class 'list'> carSeriesDocList=[[<li#s3170>], [<li#s692>], [<li#s18>], [<li#s4526>], [<li#s4871>], [<li#s5240>], [<li#s2951>], [<li#s4851>], [<li#s3304>], [<li#s5765>], [<li#s19>], [<li#s509>], [<li#s812>]] carSeriesDocListLen=13 -------------------------------------------------------------------------------- [0] eachCarSeriesDoc=<li id="s3170"> <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4><div>指导价:<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div> </li> type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'> type(carSeriesInfoDoc)=<class 'pyquery.pyquery.PyQuery'> carSeriesInfoDoc=<a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a> carSeriesName=奥迪A3 carSeriesUrl=https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594
即可。
转载请注明:在路上 » 【已解决】PySpider中获取PyQuery获取到节点的子元素