最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中获取PyQuery获取到节点的子元素

元素 crifan 1872浏览 0评论
折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,希望从:
期间需要从:
    <ul class="rank-list-ul" 0>

      <li id="s3170">
        <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
        <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170"
            href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170"
            class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>

      </li>


      <li id="s692">
。。。
提取:
<li id="s3170">
结果试了多种写法:
            merchantRankDoc = merchantRankDocList[curIdx]
            print("merchantRankDoc=%s" % merchantRankDoc)
            print("type(merchantRankDoc)=%s" % type(merchantRankDoc)) # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
            merchantRankHtml = merchantRankDoc.html()
            print("merchantRankHtml=%s" % merchantRankHtml)
            # <li id="s3170">
            # carSeriesDocGenerator = merchantRankDoc.find("li")
            carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*=s]")
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
都无法获取到,结果此处基本上都是:
None
或只有3个子节点
通过打印得知此处是:
merchantRankDoc=<Element ul at 0x109b69c78>
type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
即:
lxml.html.HtmlElement
所以,去搞清楚,如何从
lxml.html.HtmlElement的ul,获取其下多个的li
参考:
The lxml.etree Tutorial
>>> print(etree.tostring(root,pretty_print=True))
<root>
  <child1/>
  <child2/>
  <child3/>
</root>

>>> children = list(root)

>>> forchild inroot:
...     print(child.tag)
child1
child2
child3
试试:
            carSeriesDocList = list(merchantRankDoc)
            print("carSeriesDocList=%s" % carSeriesDocList)
然后去打印html
from lxml import etree
            merchantRankHtml = etree.tostring(merchantRankDoc)
            print("merchantRankHtml=%s" % merchantRankHtml)
输出:
merchantRankHtml=b'<ul class="rank-list-ul">&#13;\n                                                &#13;\n                                                <li id="s3170">&#13;\n                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">&#22885;&#36842;A3</a></h4><div>&#25351;&#23548;&#20215;&#65306;<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46&#19975;</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">&#25253;&#20215;</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">&#22270;&#24211;</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">&#20108;&#25163;&#36710;</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">&#35770;&#22363;</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">&#21475;&#30865;</a></div>&#13;\n                                                    &#13;\n                                                </li>&#13;\n                                                &#13;\n                                                。。。
。。。
。。。
href="https://k.autohome.com.cn/812/#pvareaid=103459">&#21475;&#30865;</a></div>&#13;\n                                                    &#13;\n                                                </li>&#13;\n                                                &#13;\n                                            </ul>&#13;\n                                            &#13;\n                                            '
而获取其下子元素:
            carSeriesDocList = list(merchantRankDoc)
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)
输出:
carSeriesDocList=[<Element li at 0x109b92c28>, <Element li at 0x109b92048>, <Element li at 0x109ba2548>, <Element li at 0x109ba2b38>, <Element li at 0x109ba2048>, <Element li at 0x109ba22c8>, <Element li at 0x109ba2908>, <Element li at 0x109ba2188>, <Element li at 0x109ba26d8>, <Element li at 0x109ba2b88>, <Element li at 0x109ba2ea8>, <Element li at 0x109ba2098>, <Element li at 0x109ba2e58>, <Element li at 0x109ba2368>, <Element li at 0x109ba2138>]
carSeriesDocListLen=15
好像是可以获取子节点中li元素了
但是没法直接搜索符合条件的
比如:
要的是:
      <li id="s4871">
。。。
但是不要
<li class="dashline"></li>
所以去找找,如何匹配
不过突然想起来,或许是,找找之前items返回generator,如果for循环,会不会得到的是query的对象,而不是lxml的

此处发现是:
不论是generator转为list
        merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items()
        merchantDocList = list(merchantDocGenerator)
        print("merchantDocList=%s" % merchantDocList)
还是直接for循环,都是PyQuery
type(merchantItem)=<class 'pyquery.pyquery.PyQuery'>
而不是lxml
pyquery – PyQuery complete API — pyquery 1.2.4 documentation
去看看能否直接用items()加上参数
不过好像突然发现,前面一直是lxml的元素,而不是query是忘了加上items()的原因,去加上:
        # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']")
        merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()
结果:
就是可以的了:
merchantRankDocListLen=24
而后续想要获取子元素,没获取到,是因为笔误,改回正常的:
            # carSeriesDocList = list(merchantRankDoc)
            carSeriesDocList = list(carSeriesDocGenerator)
至少逻辑上是对的了
然后再去看看
carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
后续输出
carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<Element li at 0x1082f8228>
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
carSeriesInfoDoc=<Element h4 at 0x10831d0e8>
而换成:
后续输出:
[0] eachCarSeriesDoc=<li id="s3170">&#13;
                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4><div>指导价:<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>&#13;
                                                    &#13;
                                                </li>&#13;
                                                &#13;
                                                
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
carSeriesInfoDoc=<h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4>
carSeriesName=奥迪A3
就可以获取到:
其下的子元素
经过继续调试发现:
对于:
    <ul class="rank-list-ul" 0>

      <li id="s3170">
。。。
      </li>

      <li id="s692">
。。。
      </li>
如果是find():
carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
则返回的是PyQuery
type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'>
然后generator转换成list后:
carSeriesDocList = list(carSeriesDocGenerator)
for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
    print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc))
每个元素是:
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
如果换成items()
carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
则返回的是generator:
type(carSeriesDocGenerator)=<class 'generator'>
然后generator转换成list后:
carSeriesDocList = list(carSeriesDocGenerator)
for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
每个元素是:
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
对应着官网文档中的
pyquery – PyQuery complete API — pyquery 1.2.4 documentation
PyQuery.items(selector=None)
    
Iter over elements. Return PyQuery objects:
-》items()返回的是PyQuery(=pyquery.pyquery.PyQuery)的generator
PyQuery.find(selector)
    
Find elements using selector traversing down from self:
->find() 返回的是element元素=lxml.html.HtmlElement
【总结】
此处,对于html
    <ul class="rank-list-ul" 0>

      <li id="s3170">
        <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
        <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170"
            href="//car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170"
            class="js-che168link" href="//www.che168.com/china/series3170/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>


      <li id="s692">
        <h4><a href="//www.autohome.com.cn/692/#levelsource=000000000_0&pvareaid=101594">奥迪A4L</a></h4>
        <div>指导价:<a class="red" href="//www.autohome.com.cn/692/price.html#pvareaid=101446">30.58-39.68万</a></div>
        <div><a href="//car.autohome.com.cn/price/series-692.html#pvareaid=103446">报价</a> <a id="atk_692"
            href="//car.autohome.com.cn/pic/series/692.html#pvareaid=103448">图库</a> <a data-value="692"
            class="js-che168link" href="//www.che168.com/china/series692/">二手车</a> <a
            href="//club.autohome.com.cn/bbs/forum-c-692-1.html#pvareaid=103447">论坛</a> <a
            href="//k.autohome.com.cn/692/#pvareaid=103459">口碑</a></div>
      </li>
。。。
想要获取到ul其下的多个li节点
之前出各种问题,主要原因:
  • 笔误
    • 把变量写错了
  • 不熟悉find() 和 items()返回的结果不同
    • 此处希望返回PyQuery,所以应该用items()
最后代码是:
            carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
            # carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

            carSeriesDocList = list(carSeriesDocGenerator)
            print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)

            for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                print("%s" % "-"*80)
                print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                carSeriesName = carSeriesInfoDoc.text()
                print("carSeriesName=%s" % carSeriesName)
                carSeriesUrl = carSeriesInfoDoc.attr.href
                print("carSeriesUrl=%s" % carSeriesUrl)
输出:
type(carSeriesDocGenerator)=<class 'pyquery.pyquery.PyQuery'>
type(carSeriesDocList)=<class 'list'>
carSeriesDocList=[<Element li at 0x109bc3a98>, <Element li at 0x109bc36d8>, <Element li at 0x109bc3908>, <Element li at 0x109bc3b88>, <Element li at 0x109bc3e58>, <Element li at 0x109b9c908>, <Element li at 0x109bc2c78>, <Element li at 0x109bc2d68>, <Element li at 0x109bc21d8>, <Element li at 0x109bc2958>, <Element li at 0x109bc2db8>, <Element li at 0x109bc2908>, <Element li at 0x109bc27c8>]
carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<Element li at 0x109bc3a98>
type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
[E 200815 22:15:25 base_handler:203] Empty tag name
    Traceback (most recent call last):
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 176, in _run_task
        return self._run_func(function, response, task)
      File "/Users/crifan/.pyenv/versions/3.6.6/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 155, in _run_func
        ret = function(*arguments[:len(args) - 1])
      File "<autohome_20200814>", line 138, in gradCarHtmlPage
      File "src/lxml/etree.pyx", line 1532, in lxml.etree._Element.find
      File "src/lxml/_elementpath.py", line 325, in lxml._elementpath.find
      File "src/lxml/_elementpath.py", line 102, in select
      File "src/lxml/_elementpath.py", line 103, in select
      File "src/lxml/etree.pyx", line 1437, in lxml.etree._Element.iterchildren
      File "src/lxml/etree.pyx", line 2841, in lxml.etree.ElementChildIterator.__cinit__
      File "src/lxml/etree.pyx", line 2812, in lxml.etree._ElementMatchIterator._initTagMatcher
      File "src/lxml/etree.pyx", line 2679, in lxml.etree._MultiTagMatcher.__cinit__
      File "src/lxml/etree.pyx", line 2718, in lxml.etree._MultiTagMatcher.initTagMatch
      File "src/lxml/etree.pyx", line 2749, in lxml.etree._MultiTagMatcher._storeTags
      File "src/lxml/etree.pyx", line 2736, in lxml.etree._MultiTagMatcher._storeTags
      File "src/lxml/apihelpers.pxi", line 1657, in lxml.etree._getNsTag
      File "src/lxml/apihelpers.pxi", line 1692, in lxml.etree.__getNsTag
    ValueError: Empty tag name
把find() 换 items():
            carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
            print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))

            carSeriesDocList = list(carSeriesDocGenerator)
            print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
            print("carSeriesDocList=%s" % carSeriesDocList)
            carSeriesDocListLen = len(carSeriesDocList)
            print("carSeriesDocListLen=%s" % carSeriesDocListLen)
            
            for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                print("%s" % "-"*80)
                print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                carSeriesName = carSeriesInfoDoc.text()
                print("carSeriesName=%s" % carSeriesName)
                carSeriesUrl = carSeriesInfoDoc.attr.href
                print("carSeriesUrl=%s" % carSeriesUrl)
就正常了
type(carSeriesDocGenerator)=<class 'generator'>
type(carSeriesDocList)=<class 'list'>
carSeriesDocList=[[<li#s3170>], [<li#s692>], [<li#s18>], [<li#s4526>], [<li#s4871>], [<li#s5240>], [<li#s2951>], [<li#s4851>], [<li#s3304>], [<li#s5765>], [<li#s19>], [<li#s509>], [<li#s812>]]
carSeriesDocListLen=13
--------------------------------------------------------------------------------
[0] eachCarSeriesDoc=<li id="s3170">&#13;
                                                <h4><a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a></h4><div>指导价:<a class="red" href="https://www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div><div><a href="https://car.autohome.com.cn/price/series-3170.html#pvareaid=103446">报价</a> <a id="atk_3170" href="https://car.autohome.com.cn/pic/series/3170.html#pvareaid=103448">图库</a> <a data-value="3170" class="js-che168link" href="https://www.che168.com/china/series3170/">二手车</a> <a href="https://club.autohome.com.cn/bbs/forum-c-3170-1.html#pvareaid=103447">论坛</a> <a href="https://k.autohome.com.cn/3170/#pvareaid=103459">口碑</a></div>&#13;
                                                    &#13;
                                                </li>&#13;
                                                &#13;
                                                
type(eachCarSeriesDoc)=<class 'pyquery.pyquery.PyQuery'>
type(carSeriesInfoDoc)=<class 'pyquery.pyquery.PyQuery'>
carSeriesInfoDoc=<a href="https://www.autohome.com.cn/3170/#levelsource=000000000_0&amp;pvareaid=101594">奥迪A3</a>
carSeriesName=奥迪A3
carSeriesUrl=https://www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594
即可。

转载请注明:在路上 » 【已解决】PySpider中获取PyQuery获取到节点的子元素

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
80 queries in 0.191 seconds, using 22.09MB memory