折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,遇到一个稍微特殊一点的内容的提取:
<p class="contributors"> By <a href="/teachers/authors/dav-pilkey.html" target="_self"><strong> Dav Pilkey</strong></a> , illustrated by <a href="/teachers/authors/dav-pilkey.html" target="_self"><strong> Dav Pilkey</strong></a> </p>
现在想要:
提取出:
除了authors之外,还要提取出:illustrator,要区分开。
之前用:
authors = [] contributors = response.doc('p[class="contributors"]') print("contributors=%s" % contributors) for eachAuthor in contributors.find('a[href] strong').items(): print("eachAuthor=%s" % eachAuthor) authorText = eachAuthor.text() print("authorText=%s" % authorText) authors.append(authorText)
可以获取authors没问题,但是会把illustrator混在一起。
现在要去想办法提取出来
通过代码:
for eachContributorContent in contributors.contents(): print("eachContributorContent=%s" % eachContributorContent) contentItemType = type(eachContributorContent) print("contentItemType=%s" % contentItemType) # contentText = eachContributorContent.text() # print("contentText=%s" % contentText)
调试输出:
eachContributorContent= By contentItemType=<class 'lxml.etree._ElementUnicodeResult'> eachContributorContent=<Element a at 0x102ac3e58> contentItemType=<class 'lxml.html.HtmlElement'> eachContributorContent= , illustrated by contentItemType=<class 'lxml.etree._ElementUnicodeResult'> eachContributorContent=<Element a at 0x102ac32c8> contentItemType=<class 'lxml.html.HtmlElement'> eachContributorContent= contentItemType=<class 'lxml.etree._ElementUnicodeResult'>
所以:
其中一个思路是:
去判断each的content的type,如果是str ? lxml.etree._ElementUnicodeResult
去判断是否包含:illustrated by
如果是,则开始计算illustrator的值,否则一直计算authors
如果type是lxml.html.HtmlElement,则去获取其中的.text()
然后用:
eachContributorContent.text()
会报错,然后参考:
python lxml.html.HtmlElement
“getset_descriptor
text = <attribute ‘text’ of ‘lxml.etree._Element’ object…”
看起来应该用:
text
结果此处是用text但是获得是none:
---------- eachContributorContent= By contentItemType=<class 'lxml.etree._ElementUnicodeResult'> ---------- eachContributorContent=<Element a at 0x101647458> contentItemType=<class 'lxml.html.HtmlElement'> contentText=None
继续参考:
去找,lxml.html.HtmlElement如何获取text
此处的html是:
<strong>xxx</strong>
感觉或许是:
“find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.”
去获取值,去试试
strongValue = eachContributorContent.find("strong") print("strongValue=%s" % strongValue)
得到:
strongValue=<Element strong at 0x105b5d908>
看来是可以的,然后可以去:
# contentText = eachContributorContent.text() strongElement = eachContributorContent.find("strong") print("strongElement=%s" % strongElement) # contentText = eachContributorContent.text contentText = strongElement.text print("contentText=%s" % contentText) currentList.append(contentText)
获取到值。
另外去:
else: # is text print("Not lxml.html.HtmlElement: eachContributorContent=%s" % eachContributorContent) pureText = eachContributorContent.text print("pureText=%s" % pureText) if "illustrated by" in pureText: print("+++ found illustrated by") illustrator.append(pureText)
结果:
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'text'
所以要再去搞清楚:
lxml.etree._ElementUnicodeResult 如何获得text
lxml.etree._ElementUnicodeResult text
lxml.etree._elementunicoderesult to str
“$ pydoc lxml.etree._ElementUnicodeResult
lxml.etree._ElementUnicodeResult = class _ElementUnicodeResult(__builtin__.unicode)
| Method resolution order:
| _ElementUnicodeResult
| __builtin__.unicode
| __builtin__.basestring
| __builtin__.object”
其就是从unicode继承出来的
-》可以直接看做普通的(Unicode)字符串,此处Python 3,所以可以再去加上str转换一下,以防万一:
# pureText = eachContributorContent.text pureText = str(eachContributorContent)
然后就可以得到字符串了。
【总结】
最后用代码:
authors = [] illustrator = [] # for eachAuthor in contributors.find('a[href] strong').items(): # print("eachAuthor=%s" % eachAuthor) # authorText = eachAuthor.text() # print("authorText=%s" % authorText) # authors.append(authorText) # special: has illustrator # https://www.scholastic.com/teachers/books/riff-raff-sails-the-high-cheese-by-susan-schade/ contributors = response.doc('p[class="contributors"]') print("contributors=%s" % contributors) for eachContributorItem in contributors.items(): print("eachContributorItem=%s" % eachContributorItem) itemText = eachContributorItem.text() print("itemText=%s" % itemText) # for eachContributorChild in contributors.children(): # print("eachContributorChild=%s" % eachContributorChild) # childText = eachContributorChild.text() # print("childText=%s" % childText) currentAuthorList = authors for eachContributorContent in contributors.contents(): print("---------- eachContributorContent=%s" % eachContributorContent) contentItemType = type(eachContributorContent) print("contentItemType=%s" % contentItemType) if contentItemType is lxml.html.HtmlElement: # is element # contentText = eachContributorContent.text() strongElement = eachContributorContent.find("strong") print("strongElement=%s" % strongElement) # contentText = eachContributorContent.text contentText = strongElement.text print("contentText=%s" % contentText) strippedText = contentText.strip() currentAuthorList.append( strippedText ) else: # is text print("Not lxml.html.HtmlElement: eachContributorContent=%s" % eachContributorContent) # pureText = eachContributorContent.text pureText = str(eachContributorContent) print("pureText=%s" % pureText) if "illustrated by" in pureText: print("+++ found illustrated by") currentAuthorList = illustrator print("authors=%s" % authors) print("illustrator=%s" % illustrator)
输出:
contributors=<p class="contributors"> By <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a> , illustrated by <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a> </p> eachContributorItem=<p class="contributors"> By <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a> , illustrated by <a href="https://www.scholastic.com/teachers/authors/dav-pilkey.html"><strong> Dav Pilkey</strong></a> </p> itemText=By Dav Pilkey , illustrated by Dav Pilkey ---------- eachContributorContent= By contentItemType=<class 'lxml.etree._ElementUnicodeResult'> Not lxml.html.HtmlElement: eachContributorContent= By pureText= By ---------- eachContributorContent=<Element a at 0x10fbe9958> contentItemType=<class 'lxml.html.HtmlElement'> strongElement=<Element strong at 0x10fbe9778> contentText= Dav Pilkey ---------- eachContributorContent= , illustrated by contentItemType=<class 'lxml.etree._ElementUnicodeResult'> Not lxml.html.HtmlElement: eachContributorContent= , illustrated by pureText= , illustrated by +++ found illustrated by ---------- eachContributorContent=<Element a at 0x10fbe9688> contentItemType=<class 'lxml.html.HtmlElement'> strongElement=<Element strong at 0x10fbe9a48> contentText= Dav Pilkey ---------- eachContributorContent= contentItemType=<class 'lxml.etree._ElementUnicodeResult'> Not lxml.html.HtmlElement: eachContributorContent= pureText= authors=['Dav Pilkey'] illustrator=['Dav Pilkey']
终于分析出我们要的作者和插座作者的列表了。
另外再去验证了:
也是可以分析出对应的值的:
authors=['Ellen Titlebaum', 'Cathy Hapka'] illustrator=['Debbie Palen']
转载请注明:在路上 » 【已解决】PySpider中用PyQuery提取出html中p下面的a的href中的多个strong字符串