折腾:
【未解决】Mac中用puppeteer自动操作浏览器实现百度搜索
期间,已经可以用pyppeteer去触发百度搜索了,现在去看看如何提取搜索结果。
先去搞清楚,如何匹配元素
对于
【整理】用Chrome或Chromium查看百度首页中各元素的html源码
其中的元素:
1 2 3 4 5 6 7 8 | <h3 class = "t" ><a data - click = "{ 'F' : '778317EA' , 'F1' : '9D73F1E4' , 'F2' : '4CA6DE6B' , 'F3' : '54E5243F' , 'T' : '1616767238' , 'y' : 'EFBCEFBE' } " href=" https: / / www.baidu.com / link?url = nDSbU9I2MSInD6Tq7Je06wZD - CiTQ - ckokscP4kiXneJcS0UWUPIqWHMjLDyn5uW&wd = &eqid = 919e8ff000236bc300000004605de906 " target=" _blank"><em>crifan< / em> (<em>Crifan< / em> Li) · GitHub< / a>< / h3> |
去看看如何写selector去匹配元素
1 2 3 | h3ASelector = "h3[class^='t'] a" aElemList = await page.querySelectorAll(h3ASelector) print( "aElemList=%s" % aElemList) |
可以解析到:
![](https://www.crifan.com/files/pic/uploads/2021/05/916c2c9b32ad458a959776030e3396c4.jpg)
继续研究。
再去搞清楚,如何提取元素的值
puppeteer extract text
刚注意官网
竟然就是:
1 2 | element = await page.querySelector( 'h1' ) title = await page.evaluate( '(element) => element.textContent' , element) |
可以用
1 2 3 4 5 6 7 8 9 10 11 12 13 | # elements = await page.xpath('//div[@class="title-box"]/a') elements = await page.querySelectorAll( ".title-box a" ) for item in elements: print (await item.getProperty( 'textContent' )) # <pyppeteer.execution_context.JSHandle object at 0x000002220E7FE518> # 获取文本 title_str = await (await item.getProperty( 'textContent' )).jsonValue() # 获取链接 title_link = await (await item.getProperty( 'href' )).jsonValue() |
继续写代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | searchResultANum = len (searchResultAList) print ( "searchResultANum=%s" % searchResultANum) for curIdx, aElem in enumerate (searchResultAList): curNum = curIdx + 1 print ( "%s [%d] %s" % ( "-" * 20 , curNum, "-" * 20 )) aTextJSHandle = await aElem.getProperty( 'textContent' ) print ( "type(aTextJSHandle)=%s" % type (aTextJSHandle)) print ( "aTextJSHandle=%s" % aTextJSHandle) title = await aTextJSHandle.jsonValue() print ( "type(title)=%s" % type (title)) print ( "title=%s" % title) baiduLinkUrl = await (await aElem.getProperty( "href" )).jsonValue() print ( "baiduLinkUrl=%s" % baiduLinkUrl) |
调试,结果:
【已解决】pyppeteer中page.querySelectorAll运行时无法获取到结果
然后代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | resultASelector = "h3[class^='t'] a" searchResultAList = await page.querySelectorAll(resultASelector) print ( "searchResultAList=%s" % searchResultAList) searchResultANum = len (searchResultAList) print ( "searchResultANum=%s" % searchResultANum) for curIdx, aElem in enumerate (searchResultAList): curNum = curIdx + 1 print ( "%s [%d] %s" % ( "-" * 20 , curNum, "-" * 20 )) aTextJSHandle = await aElem.getProperty( 'textContent' ) print ( "type(aTextJSHandle)=%s" % type (aTextJSHandle)) print ( "aTextJSHandle=%s" % aTextJSHandle) title = await aTextJSHandle.jsonValue() print ( "type(title)=%s" % type (title)) print ( "title=%s" % title) baiduLinkUrl = await (await aElem.getProperty( "href" )).jsonValue() print ( "baiduLinkUrl=%s" % baiduLinkUrl) |
一次性通过,是正常的:
![](https://www.crifan.com/files/pic/uploads/2021/05/aacca79675704b24827907485f2a4ce9.jpg)
输出:
1 2 3 4 5 6 7 8 | searchResultAList = [<pyppeteer.element_handle.ElementHandle object at 0x10309e860 >, <pyppeteer.element_handle.ElementHandle object at 0x10309e278 >, <pyppeteer.element_handle.ElementHandle object at 0x10309e0f0 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b0b00 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b0710 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b0198 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b06d8 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b0160 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b0ef0 >, <pyppeteer.element_handle.ElementHandle object at 0x1030b30b8 >] searchResultANum = 10 - - - - - - - - - - - - - - - - - - - - [ 1 ] - - - - - - - - - - - - - - - - - - - - type (aTextJSHandle) = < class 'pyppeteer.execution_context.JSHandle' > aTextJSHandle = <pyppeteer.execution_context.JSHandle object at 0x10309c9b0 > type (title) = < class 'str' > title = 在路上on the way - 走别人没走过的路,让别人有路可走 baiduLinkUrl = http: / / www.baidu.com / link?url = 1l0yCSpJcbVYGRoCdaXUal3NIDZXTn9F1Q4XwZcd6KzpAmUVPLca_wIIqFVlaBs6 |
【总结】
此处最后用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | ################################################################################ # Wait page reload complete ################################################################################ SearchFoundWordsSelector = 'span.nums_text' SearchFoundWordsXpath = "//span[@class='nums_text']" # await page.waitForSelector(SearchFoundWordsSelector) # await page.waitFor(SearchFoundWordsSelector) # await page.waitForXPath(SearchFoundWordsXpath) # Note: all above exception: 发生异常: ElementHandleError Evaluation failed: TypeError: MutationObserver is not a constructor # so change to following # # Method 1: just wait # await page.waitFor(2000) # millisecond # Method 2: wait element showing SingleWaitSeconds = 1 while not await page.querySelector(SearchFoundWordsSelector): print( "Still not found %s, wait %s seconds" % (SearchFoundWordsSelector, SingleWaitSeconds)) await asyncio.sleep(SingleWaitSeconds) # pass |
确保页面内容加载完毕。
再用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ################################################################################ # Extract result ################################################################################ resultASelector = "h3[class^='t'] a" searchResultAList = await page.querySelectorAll(resultASelector) # print("searchResultAList=%s" % searchResultAList) searchResultANum = len (searchResultAList) print ( "Found %s search result:" % searchResultANum) for curIdx, aElem in enumerate (searchResultAList): curNum = curIdx + 1 print ( "%s [%d] %s" % ( "-" * 20 , curNum, "-" * 20 )) aTextJSHandle = await aElem.getProperty( 'textContent' ) # print("type(aTextJSHandle)=%s" % type(aTextJSHandle)) # type(aTextJSHandle)=<class 'pyppeteer.execution_context.JSHandle'> # print("aTextJSHandle=%s" % aTextJSHandle) # aTextJSHandle=<pyppeteer.execution_context.JSHandle object at 0x10309c9b0> title = await aTextJSHandle.jsonValue() # print("type(title)=%s" % type(title)) # type(title)=<class 'str'> print ( "title=%s" % title) baiduLinkUrl = await (await aElem.getProperty( "href" )).jsonValue() print ( "baiduLinkUrl=%s" % baiduLinkUrl) |
提取出要的结果。
输出:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | Found 10 search result: - - - - - - - - - - - - - - - - - - - - [ 1 ] - - - - - - - - - - - - - - - - - - - - title = 在路上on the way - 走别人没走过的路,让别人有路可走 baiduLinkUrl = http: / / www.baidu.com / link?url = eGTzEXXlMw - hnvXYSFk8t4VSZPck1dougn7YhfCwBf3ZzGJEHdZYsoAQK - 4GBJuP - - - - - - - - - - - - - - - - - - - - [ 2 ] - - - - - - - - - - - - - - - - - - - - title = crifan – 在路上 baiduLinkUrl = http: / / www.baidu.com / link?url = l6jXejlgARrWj34ODgKWZ9BeNKwyYZLRhLb5B8oDFVqNpHoco8a_qbAdD1m - t_cf - - - - - - - - - - - - - - - - - - - - [ 3 ] - - - - - - - - - - - - - - - - - - - - title = crifan简介_crifan的专栏 - CSDN博客_crifan baiduLinkUrl = http: / / www.baidu.com / link?url = IIqPM5wuVE_QP7S357 - 1bJWGGU1kpFcAZ945BaXUQNpaDzXihf_98wAVi05Gk6 - 8Qu4aGLv2Rv65WJm6Qr5kk_ - - - - - - - - - - - - - - - - - - - - [ 4 ] - - - - - - - - - - - - - - - - - - - - title = crifan的微博_微博 baiduLinkUrl = http: / / www.baidu.com / link?url = NnqeMlu4Jr_Ld - zoui8pbQO4eRMMO9pLd_DHXagqcdZ46NF4CSuyEziKSTpqCNEi - - - - - - - - - - - - - - - - - - - - [ 5 ] - - - - - - - - - - - - - - - - - - - - title = Crifan的电子书大全 | crifan.github.io baiduLinkUrl = http: / / www.baidu.com / link?url = uOZ - AmgNBNr3mGdETezIjTvtedH_ueM6 - LNOc2QxbjcNeS8LuVBY - kirwogX7qLl - - - - - - - - - - - - - - - - - - - - [ 6 ] - - - - - - - - - - - - - - - - - - - - title = GitHub - crifan / crifanLib: crifan's library baiduLinkUrl = http: / / www.baidu.com / link?url = t42I1rYfn32DGw9C6cw_5lB - z1worKzEuROOtWj - Jyf1l2IBNBcz - l85hSKv9s9T - - - - - - - - - - - - - - - - - - - - [ 7 ] - - - - - - - - - - - - - - - - - - - - title = 在路上www.crifan.com - 网站排行榜 baiduLinkUrl = http: / / www.baidu.com / link?url = WwLwfXA72vK08Obyx2hwqA3 - wmq8jAisi4VVSt2R0Ml3ccCy_yxeYfxD2xouAX - i5AyUU1U_2EghwVbJ2p - ipa - - - - - - - - - - - - - - - - - - - - [ 8 ] - - - - - - - - - - - - - - - - - - - - title = crifan的专栏_crifan_CSDN博客 - crifan领域博主 baiduLinkUrl = http: / / www.baidu.com / link?url = Cmcn2mXwiZr87FBGQBq85Np0hgGTP_AK2yLUW6GDeA21r7Q5WvUOUjaKZo5Jhb0f - - - - - - - - - - - - - - - - - - - - [ 9 ] - - - - - - - - - - - - - - - - - - - - title = User crifan - Stack Overflow baiduLinkUrl = http: / / www.baidu.com / link?url = yGgsq1z2vNDAAeWY - 5VDWbHv7e7zPILHI4GVFPZd6MaFrGjYHsb3Onir1Vi6vvZqD7QAGJrZehIYZpcBfh_Gq_ - - - - - - - - - - - - - - - - - - - - [ 10 ] - - - - - - - - - - - - - - - - - - - - title = crifan - Bing 词典 baiduLinkUrl = http: / / www.baidu.com / link?url = UatxhUBL3T_1ikPco5OazvJaWkVqCeCHh4eoA6AX_lP4t_Bx3GVHlMHZjgu3YAwE |
效果:
![](https://www.crifan.com/files/pic/uploads/2021/05/08f58af2b9514fe8ad874e40d1522d64.jpg)
转载请注明:在路上 » 【已解决】pyppeteer中提取百度搜索结果中的信息