最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】PySpider中用PyQuery提取出html中p下面的a的href中的多个strong字符串

字符串 crifan 1034浏览 0评论
折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,遇到一个稍微特殊一点的内容的提取:
https://www.scholastic.com/teachers/books/lord-of-the-fleas-by-dav-pilkey/
<p class="contributors">
                By   
                
                <a href="/teachers/authors/dav-pilkey.html" target="_self"><strong> Dav Pilkey</strong></a>
                , 
                illustrated by       
                <a href="/teachers/authors/dav-pilkey.html" target="_self"><strong> Dav Pilkey</strong></a>
            </p>
现在想要:
提取出:
除了authors之外,还要提取出:illustrator,要区分开。
之前用:
        authors = []
        contributors = response.doc('p[class="contributors"]')
        print("contributors=%s" % contributors)
        for eachAuthor in contributors.find('a[href] strong').items():
            print("eachAuthor=%s" % eachAuthor)
            authorText = eachAuthor.text()
            print("authorText=%s" % authorText)
            authors.append(authorText)
可以获取authors没问题,但是会把illustrator混在一起。
现在要去想办法提取出来
通过代码:
        for eachContributorContent in contributors.contents():
            print("eachContributorContent=%s" % eachContributorContent)
            contentItemType = type(eachContributorContent)
            print("contentItemType=%s" % contentItemType)
            # contentText = eachContributorContent.text()
            # print("contentText=%s" % contentText)
调试输出:
eachContributorContent=
                 
                By   
                
                
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
eachContributorContent=<Element a at 0x102ac3e58>
contentItemType=<class 'lxml.html.HtmlElement'>
eachContributorContent=
                , 
            
                
                 
                illustrated by       
                 
                
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
eachContributorContent=<Element a at 0x102ac32c8>
contentItemType=<class 'lxml.html.HtmlElement'>
eachContributorContent=
                
            
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
所以:
其中一个思路是:
去判断each的content的type,如果是str ? lxml.etree._ElementUnicodeResult
去判断是否包含:illustrated by
如果是,则开始计算illustrator的值,否则一直计算authors
如果type是lxml.html.HtmlElement,则去获取其中的.text()
然后用:
eachContributorContent.text()
会报错,然后参考:
python lxml.html.HtmlElement
lxml.html.HtmlElement
lxml.etree._Element
“getset_descriptor
text = <attribute ‘text’ of ‘lxml.etree._Element’ object…”
看起来应该用:
text
结果此处是用text但是获得是none:
---------- eachContributorContent=
                                 
                                By   
                                
                                
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
---------- eachContributorContent=<Element a at 0x101647458>
contentItemType=<class 'lxml.html.HtmlElement'>
contentText=None
继续参考:
https://pythonhosted.org/pyquery/api.html#module-pyquery.pyquery
去找,lxml.html.HtmlElement如何获取text
python – extracting attributes from html with lxml – Stack Overflow
此处的html是:
<strong>xxx</strong>
感觉或许是:
https://lxml.de/api/lxml.etree._Element-class.html
“find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.”
去获取值,去试试
strongValue = eachContributorContent.find("strong")
print("strongValue=%s" % strongValue)
得到:
strongValue=<Element strong at 0x105b5d908>
看来是可以的,然后可以去:
# contentText = eachContributorContent.text()
strongElement = eachContributorContent.find("strong")
print("strongElement=%s" % strongElement)
# contentText = eachContributorContent.text
contentText = strongElement.text
print("contentText=%s" % contentText)
currentList.append(contentText)
获取到值。
另外去:
            else:
                # is text
                print("Not lxml.html.HtmlElement: eachContributorContent=%s" % eachContributorContent)
                pureText = eachContributorContent.text
                print("pureText=%s" % pureText)
                if "illustrated by" in pureText:
                    print("+++ found illustrated by")
                    illustrator.append(pureText)
 结果:
    AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'text'
所以要再去搞清楚:
lxml.etree._ElementUnicodeResult 如何获得text
lxml.etree._ElementUnicodeResult text
lxml.etree._elementunicoderesult to str
which encoding does the python lxml module use internally? – Stack Overflow
“$ pydoc lxml.etree._ElementUnicodeResult
lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
|  Method resolution order:
|      _ElementUnicodeResult
|      __builtin__.unicode
|      __builtin__.basestring
|      __builtin__.object”
其就是从unicode继承出来的
-》可以直接看做普通的(Unicode)字符串,此处Python 3,所以可以再去加上str转换一下,以防万一:
# pureText = eachContributorContent.text
pureText = str(eachContributorContent)
然后就可以得到字符串了。
【总结】
最后用代码:
        authors = []
        illustrator = []

        # for eachAuthor in contributors.find('a[href] strong').items():
        #     print("eachAuthor=%s" % eachAuthor)
        #     authorText = eachAuthor.text()
        #     print("authorText=%s" % authorText)
        #     authors.append(authorText)

        # special: has illustrator
        # 
https://www.scholastic.com/teachers/books/riff-raff-sails-the-high-cheese-by-susan-schade/

        contributors = response.doc('p[class="contributors"]')
        print("contributors=%s" % contributors)
        for eachContributorItem in contributors.items():
            print("eachContributorItem=%s" % eachContributorItem)
            itemText = eachContributorItem.text()
            print("itemText=%s" % itemText)

        # for eachContributorChild in contributors.children():
        #     print("eachContributorChild=%s" % eachContributorChild)
        #     childText = eachContributorChild.text()
        #     print("childText=%s" % childText)
        currentAuthorList = authors
        for eachContributorContent in contributors.contents():
            print("---------- eachContributorContent=%s" % eachContributorContent)
            contentItemType = type(eachContributorContent)
            print("contentItemType=%s" % contentItemType)
            if contentItemType is lxml.html.HtmlElement:
                # is element
                # contentText = eachContributorContent.text()
                strongElement = eachContributorContent.find("strong")
                print("strongElement=%s" % strongElement)
                # contentText = eachContributorContent.text
                contentText = strongElement.text
                print("contentText=%s" % contentText)
                
strippedText = contentText.strip()
                currentAuthorList.append(
strippedText
)
            else:
                # is text
                print("Not lxml.html.HtmlElement: eachContributorContent=%s" % eachContributorContent)
                # pureText = eachContributorContent.text
                pureText = str(eachContributorContent)
                print("pureText=%s" % pureText)
                if "illustrated by" in pureText:
                    print("+++ found illustrated by")
                    currentAuthorList = illustrator

        print("authors=%s" % authors)
        print("illustrator=%s" % illustrator)
输出:
contributors=<p class="contributors">
                                 
                                By   
                                
                                <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a>
                                , 
                        
                                
                                 
                                illustrated by       
                                 
                                <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a>
                                
                        </p>
                        
eachContributorItem=<p class="contributors">
                                 
                                By   
                                
                                <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a>
                                , 
                        
                                
                                 
                                illustrated by       
                                 
                                <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a>
                                
                        </p>
                        
itemText=By Dav Pilkey , illustrated by Dav Pilkey
---------- eachContributorContent=
                                 
                                By   
                                
                                
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
Not lxml.html.HtmlElement: eachContributorContent=
                                 
                                By   
                                
                                
pureText=
                                 
                                By   
                                
                                
---------- eachContributorContent=<Element a at 0x10fbe9958>
contentItemType=<class 'lxml.html.HtmlElement'>
strongElement=<Element strong at 0x10fbe9778>
contentText= Dav Pilkey
---------- eachContributorContent=
                                , 
                        
                                
                                 
                                illustrated by       
                                 
                                
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
Not lxml.html.HtmlElement: eachContributorContent=
                                , 
                        
                                
                                 
                                illustrated by       
                                 
                                
pureText=
                                , 
                        
                                
                                 
                                illustrated by       
                                 
                                
+++ found illustrated by
---------- eachContributorContent=<Element a at 0x10fbe9688>
contentItemType=<class 'lxml.html.HtmlElement'>
strongElement=<Element strong at 0x10fbe9a48>
contentText= Dav Pilkey
---------- eachContributorContent=
                                
                        
contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
Not lxml.html.HtmlElement: eachContributorContent=
                                
                        
pureText=
                                
                        
authors=['Dav Pilkey']
illustrator=['Dav Pilkey']
终于分析出我们要的作者和插座作者的列表了。
另外再去验证了:
How Not to Start Third Grade by Ellen TitlebaumCathy Hapka | Scholastic
也是可以分析出对应的值的:
authors=['Ellen Titlebaum', 'Cathy Hapka']
illustrator=['Debbie Palen']

转载请注明:在路上 » 【已解决】PySpider中用PyQuery提取出html中p下面的a的href中的多个strong字符串

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
81 queries in 0.188 seconds, using 22.15MB memory