折腾:
【未解决】Mac中用puppeteer自动操作浏览器实现百度搜索
期间,已经可以用pyppeteer去触发百度搜索了,现在去看看如何提取搜索结果。
先去搞清楚,如何匹配元素
对于
【整理】用Chrome或Chromium查看百度首页中各元素的html源码
其中的元素:
<h3 class="t"><a data-click="{ 'F':'778317EA', 'F1':'9D73F1E4', 'F2':'4CA6DE6B', 'F3':'54E5243F', 'T':'1616767238', 'y':'EFBCEFBE' }" href="https://www.baidu.com/link?url=nDSbU9I2MSInD6Tq7Je06wZD-CiTQ-ckokscP4kiXneJcS0UWUPIqWHMjLDyn5uW&wd=&eqid=919e8ff000236bc300000004605de906" target="_blank"><em>crifan</em> (<em>Crifan</em> Li) · GitHub</a></h3>
去看看如何写selector去匹配元素
h3ASelector = "h3[class^='t'] a" aElemList = await page.querySelectorAll(h3ASelector) print("aElemList=%s" % aElemList)
可以解析到:
继续研究。
再去搞清楚,如何提取元素的值
puppeteer extract text
刚注意官网
竟然就是:
element = await page.querySelector('h1') title = await page.evaluate('(element) => element.textContent', element)
可以用
# elements = await page.xpath('//div[@class="title-box"]/a') elements = await page.querySelectorAll(".title-box a") for item in elements: print(await item.getProperty('textContent')) # <pyppeteer.execution_context.JSHandle object at 0x000002220E7FE518> # 获取文本 title_str = await (await item.getProperty('textContent')).jsonValue() # 获取链接 title_link = await (await item.getProperty('href')).jsonValue()
继续写代码
searchResultANum = len(searchResultAList) print("searchResultANum=%s" % searchResultANum) for curIdx, aElem in enumerate(searchResultAList): curNum = curIdx + 1 print("%s [%d] %s" % ("-"*20, curNum, "-"*20)) aTextJSHandle = await aElem.getProperty('textContent') print("type(aTextJSHandle)=%s" % type(aTextJSHandle)) print("aTextJSHandle=%s" % aTextJSHandle) title = await aTextJSHandle.jsonValue() print("type(title)=%s" % type(title)) print("title=%s" % title) baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue() print("baiduLinkUrl=%s" % baiduLinkUrl)
调试,结果:
【已解决】pyppeteer中page.querySelectorAll运行时无法获取到结果
然后代码
resultASelector = "h3[class^='t'] a" searchResultAList = await page.querySelectorAll(resultASelector) print("searchResultAList=%s" % searchResultAList) searchResultANum = len(searchResultAList) print("searchResultANum=%s" % searchResultANum) for curIdx, aElem in enumerate(searchResultAList): curNum = curIdx + 1 print("%s [%d] %s" % ("-"*20, curNum, "-"*20)) aTextJSHandle = await aElem.getProperty('textContent') print("type(aTextJSHandle)=%s" % type(aTextJSHandle)) print("aTextJSHandle=%s" % aTextJSHandle) title = await aTextJSHandle.jsonValue() print("type(title)=%s" % type(title)) print("title=%s" % title) baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue() print("baiduLinkUrl=%s" % baiduLinkUrl)
一次性通过,是正常的:
输出:
searchResultAList=[<pyppeteer.element_handle.ElementHandle object at 0x10309e860>, <pyppeteer.element_handle.ElementHandle object at 0x10309e278>, <pyppeteer.element_handle.ElementHandle object at 0x10309e0f0>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0b00>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0710>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0198>, <pyppeteer.element_handle.ElementHandle object at 0x1030b06d8>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0160>, <pyppeteer.element_handle.ElementHandle object at 0x1030b0ef0>, <pyppeteer.element_handle.ElementHandle object at 0x1030b30b8>] searchResultANum=10 -------------------- [1] -------------------- type(aTextJSHandle)=<class 'pyppeteer.execution_context.JSHandle'> aTextJSHandle=<pyppeteer.execution_context.JSHandle object at 0x10309c9b0> type(title)=<class 'str'> title=在路上on the way - 走别人没走过的路,让别人有路可走 baiduLinkUrl=http://www.baidu.com/link?url=1l0yCSpJcbVYGRoCdaXUal3NIDZXTn9F1Q4XwZcd6KzpAmUVPLca_wIIqFVlaBs6
【总结】
此处最后用:
################################################################################ # Wait page reload complete ################################################################################ SearchFoundWordsSelector = 'span.nums_text' SearchFoundWordsXpath = "//span[@class='nums_text']" # await page.waitForSelector(SearchFoundWordsSelector) # await page.waitFor(SearchFoundWordsSelector) # await page.waitForXPath(SearchFoundWordsXpath) # Note: all above exception: 发生异常: ElementHandleError Evaluation failed: TypeError: MutationObserver is not a constructor # so change to following # # Method 1: just wait # await page.waitFor(2000) # millisecond # Method 2: wait element showing SingleWaitSeconds = 1 while not await page.querySelector(SearchFoundWordsSelector): print("Still not found %s, wait %s seconds" % (SearchFoundWordsSelector, SingleWaitSeconds)) await asyncio.sleep(SingleWaitSeconds) # pass
确保页面内容加载完毕。
再用:
################################################################################ # Extract result ################################################################################ resultASelector = "h3[class^='t'] a" searchResultAList = await page.querySelectorAll(resultASelector) # print("searchResultAList=%s" % searchResultAList) searchResultANum = len(searchResultAList) print("Found %s search result:" % searchResultANum) for curIdx, aElem in enumerate(searchResultAList): curNum = curIdx + 1 print("%s [%d] %s" % ("-"*20, curNum, "-"*20)) aTextJSHandle = await aElem.getProperty('textContent') # print("type(aTextJSHandle)=%s" % type(aTextJSHandle)) # type(aTextJSHandle)=<class 'pyppeteer.execution_context.JSHandle'> # print("aTextJSHandle=%s" % aTextJSHandle) # aTextJSHandle=<pyppeteer.execution_context.JSHandle object at 0x10309c9b0> title = await aTextJSHandle.jsonValue() # print("type(title)=%s" % type(title)) # type(title)=<class 'str'> print("title=%s" % title) baiduLinkUrl = await (await aElem.getProperty("href")).jsonValue() print("baiduLinkUrl=%s" % baiduLinkUrl)
提取出要的结果。
输出:
Found 10 search result: -------------------- [1] -------------------- title=在路上on the way - 走别人没走过的路,让别人有路可走 baiduLinkUrl=http://www.baidu.com/link?url=eGTzEXXlMw-hnvXYSFk8t4VSZPck1dougn7YhfCwBf3ZzGJEHdZYsoAQK-4GBJuP -------------------- [2] -------------------- title=crifan – 在路上 baiduLinkUrl=http://www.baidu.com/link?url=l6jXejlgARrWj34ODgKWZ9BeNKwyYZLRhLb5B8oDFVqNpHoco8a_qbAdD1m-t_cf -------------------- [3] -------------------- title=crifan简介_crifan的专栏-CSDN博客_crifan baiduLinkUrl=http://www.baidu.com/link?url=IIqPM5wuVE_QP7S357-1bJWGGU1kpFcAZ945BaXUQNpaDzXihf_98wAVi05Gk6-8Qu4aGLv2Rv65WJm6Qr5kk_ -------------------- [4] -------------------- title=crifan的微博_微博 baiduLinkUrl=http://www.baidu.com/link?url=NnqeMlu4Jr_Ld-zoui8pbQO4eRMMO9pLd_DHXagqcdZ46NF4CSuyEziKSTpqCNEi -------------------- [5] -------------------- title=Crifan的电子书大全 | crifan.github.io baiduLinkUrl=http://www.baidu.com/link?url=uOZ-AmgNBNr3mGdETezIjTvtedH_ueM6-LNOc2QxbjcNeS8LuVBY-kirwogX7qLl -------------------- [6] -------------------- title=GitHub - crifan/crifanLib: crifan's library baiduLinkUrl=http://www.baidu.com/link?url=t42I1rYfn32DGw9C6cw_5lB-z1worKzEuROOtWj-Jyf1l2IBNBcz-l85hSKv9s9T -------------------- [7] -------------------- title=在路上www.crifan.com - 网站排行榜 baiduLinkUrl=http://www.baidu.com/link?url=WwLwfXA72vK08Obyx2hwqA3-wmq8jAisi4VVSt2R0Ml3ccCy_yxeYfxD2xouAX-i5AyUU1U_2EghwVbJ2p-ipa -------------------- [8] -------------------- title=crifan的专栏_crifan_CSDN博客-crifan领域博主 baiduLinkUrl=http://www.baidu.com/link?url=Cmcn2mXwiZr87FBGQBq85Np0hgGTP_AK2yLUW6GDeA21r7Q5WvUOUjaKZo5Jhb0f -------------------- [9] -------------------- title=User crifan - Stack Overflow baiduLinkUrl=http://www.baidu.com/link?url=yGgsq1z2vNDAAeWY-5VDWbHv7e7zPILHI4GVFPZd6MaFrGjYHsb3Onir1Vi6vvZqD7QAGJrZehIYZpcBfh_Gq_ -------------------- [10] -------------------- title=crifan - Bing 词典 baiduLinkUrl=http://www.baidu.com/link?url=UatxhUBL3T_1ikPco5OazvJaWkVqCeCHh4eoA6AX_lP4t_Bx3GVHlMHZjgu3YAwE
效果:
转载请注明:在路上 » 【已解决】pyppeteer中提取百度搜索结果中的信息