最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Selenium中Python解析百度搜索结果第一页获取标题列表

搜索 crifan 1612浏览 0评论
折腾:
【未解决】Mac中用Selenium自动操作浏览器实现百度搜索
期间,已经用Selenium实现了百度首页的输入并搜索,显示出搜索结果了:
接下来,想办法实现,解析出搜索结果的标题的列表
此处尽量用多种方式去获取到最终结果,以演示如何写代码解析网页内容。
其中一种,和此处最相关的是,用Selenium自带的函数。

然后去:
【已解决】Selenium中如何实现百度搜索结果标题元素的定位
用Selenium的代码是:
# get search result item list
searchResultAList = chromeDriver.find_elements_by_xpath("//h3[contains(@class, 't')]/a")
print("searchResultAList=%s" % searchResultAList)
for curIdx, curSearchResultAElem in enumerate(searchResultAList):
  print("%s [%d] %s" % ("-"*20, curIdx, "-"*20))
  aHref = curSearchResultAElem.get_attribute("href")
  print("aHref=%s" % aHref)
  aText = curSearchResultAElem.text
  print("aText=%s" % aText)
输出:
-------------------- [0] --------------------
aHref=http://www.baidu.com/link?url=LMF5vQH-QgOuEhaq5huV3bLHlwVSDbVwv2g6vUYJ9AjmaCyIWKuL8f1YR5uOGzUc
aText=在路上on the way - 走别人没走过的路,让别人有路可走
-------------------- [1] --------------------
aHref=http://www.baidu.com/link?url=n4QoZVrJ5gncFIpJZhRcdmoA-oEmVewHEriXaLesj_wuypw3ZGebZ8sgC56-3ueD
aText=crifan – 在路上
-------------------- [2] --------------------
aHref=https://www.baidu.com/link?url=N1OgXdaJPO9zLGVbKm5lrRD53sKIgPXnWg_4yMTtm0Do5kyKMHPNxMOHUXePGIgEuLTGc9LFgtxuHXZu1oTZba&wd=&eqid=eb6d0c370028fb9b00000004605df63e
aText=crifan简介_crifan的专栏-CSDN博客_crifan
-------------------- [3] --------------------
aHref=http://www.baidu.com/link?url=trh6Nvw5xKEgQIi5OMxz6Dpaol45mKCetSyFG6jspja4suH7tcgRrzFOvJTarNmW
aText=crifan的微博_微博
-------------------- [4] --------------------
aHref=http://www.baidu.com/link?url=rj2E9lqd9iFHWNV-9O-CXLOSMAJLazrFdp0-ERlbbVZyqKK_DzA21oBIV42W1NfQ
aText=Crifan的电子书大全 | crifan.github.io
-------------------- [5] --------------------
aHref=http://www.baidu.com/link?url=LgitikJywqZ5Cp-kCVddlAalVnhpUn7oRC_PRJlU_SB2NKPDSr4zcGpgsKanlx9S
aText=GitHub - crifan/crifanLib: crifan's library
-------------------- [6] --------------------
aHref=http://www.baidu.com/link?url=ag8E9gi5fxiiAetDLFkyFcRt0JDpYeUzT2JkJ19j-WjEY6qpYKsXCxN14pkDS0fYFH6fkIeOS0wl3u1diuVBPK
aText=在路上www.crifan.com - 网站排行榜
-------------------- [7] --------------------
aHref=http://www.baidu.com/link?url=sLYSlrlBaGvNq0iT1bAOFXWU1_owJB3Zpw35xI_esHFcHfToQ5J920ypHXOWBraj
aText=crifan的专栏_crifan_CSDN博客-crifan领域博主
-------------------- [8] --------------------
aHref=http://www.baidu.com/link?url=naLo4Rd4SAqiJ6PPtU6KAWJ9p5wNXnMwejFMcPoHuHwUUrlx2a2PRibCeFrR1yO1hcsDwFUXVVNIBJI03mHBca
aText=User crifan - Stack Overflow
-------------------- [9] --------------------
aHref=http://www.baidu.com/link?url=wm4YOCeoG-84H2glTjRfwGZ1JY9slAu1MeUtuAQVE9yKSK-14IeyeY1b-BfxWKH3
aText=crifan - Bing 词典
效果:
另外
心得和总结 · Selenium知识总结
  • get_attribute(name)
  • get_property(name)
回头都试试
其中可见,对于,从复杂的html代码解析出所需要的值,往往比较费精力
对此,其实Python中有更专业的库干这个:BeautifulSoup
接着去想办法看看,能否获取Selenium的当前页面的html,然后在用BeautifulSoup去解析获取所需的值
那去搞清楚:
【已解决】Selenium中如何获取到当前页面的html源码
然后去写代码:
# Method 2: use BeautifulSoup to extract title list
curHtml = chromeDriver.page_source
curSoup = BeautifulSoup(curHtml, 'html.parser')
beginTP = re.compile("^t.*")
searchResultH3List = curSoup.find_all("h3", {"class": beginTP})
print("searchResultH3List=%s" % searchResultH3List)
经过调试,是可以找到H3的元素的:
继续调试:
for curIdx, searchResultH3Item in enumerate(searchResultH3List):
  print("%s [%d] %s" % ("-"*20, curIdx, "-"*20))
  aElem = searchResultH3Item.find("a")
  print("aElem=%s" % aElem)
  baiduLinkUrl = aElem.attrs["href"]
  print("baiduLinkUrl=%s" % baiduLinkUrl)
  title = aElem.text
  print("title=%s" % title)
是可以的:
【总结】
至此,用BeautifulSoup的代码去解析出百度搜索结果的列表:
# Method 2: use BeautifulSoup to extract title list
curHtml = chromeDriver.page_source
curSoup = BeautifulSoup(curHtml, 'html.parser')
beginTP = re.compile("^t.*")
searchResultH3List = curSoup.find_all("h3", {"class": beginTP})
print("searchResultH3List=%s" % searchResultH3List)
for curIdx, searchResultH3Item in enumerate(searchResultH3List):
  print("%s [%d] %s" % ("-"*20, curIdx, "-"*20))
  aElem = searchResultH3Item.find("a")
  # print("aElem=%s" % aElem)
  baiduLinkUrl = aElem.attrs["href"]
  print("baiduLinkUrl=%s" % baiduLinkUrl)
  title = aElem.text
  print("title=%s" % title)
输出:
-------------------- [0] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=DVUbOETLyMZLC5c_V7RJReScFExnTjXjyTsO_QO_0rOL0vSE4mMNIPaZLH7iIaHI
title=在路上on the way - 走别人没走过的路,让别人有路可走
-------------------- [1] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=xA8mzlRBwfRb_I-PgUMj9_COWGmdEr-GcNo-DlxCqYzTKYsjqpLrmQImHO5X41Qy
title=crifan – 在路上
-------------------- [2] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=v8Doo53CgO-cFNYo-Wp2FKL8zfOxvuzhOmwSeTLzCqGA_AOjbYcjYdovqikkMmJifiQhJ6dLSMC_UW0VERBRma
title=crifan简介_crifan的专栏-CSDN博客_crifan
-------------------- [3] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=69wfHGVLYJIn71DQl_6aD9bf2LAthOALzmUxqZLgYKL_v44CcN7JPV0fZdsgDQnw
title=crifan的微博_微博
-------------------- [4] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=SMjZmBSy1a9rX7NH-vufC_7X2Q5aqYT1dZQKHpttphLiMkTfr6ZgRFeUT3K8PNW7
title=Crifan的电子书大全 | crifan.github.io
-------------------- [5] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=TH0Qi8mZBJO7jC1kHTPW9v1xAiSmC2TgDwWA2di1cX0Eph8cJr6wRQFDES61P_DN
title=GitHub - crifan/crifanLib: crifan's library
-------------------- [6] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=owqfQOM_pEdizGhyYOvBblTE5Z0qQTr3D23ndhxxoIS0K28x4f2xVYMJdb6jwRb30vZHpm1MQDunbkBczT3Vrq
title=在路上www.crifan.com - 网站排行榜
-------------------- [7] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=w2P8G7ENsLi9vs6gO5RTX-PH4d_nzPud16wY1Er2ouGTQ4caZODnyj4PY2dTh1rI
title=crifan的专栏_crifan_CSDN博客-crifan领域博主
-------------------- [8] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=Altkwc-vb6UWaZfpx3B5QRvBpeA6jvcvlmasdkl5-31FY8QmvI1YQaYlQwrrRT2h0QxoI4QCfGFgITJKORD0da
title=User crifan - Stack Overflow
-------------------- [9] --------------------
baiduLinkUrl=http://www.baidu.com/link?url=LIG9Iz3l1_GxXuk1-XgSQUzL49Rm4q7pTCekyI_ehU4yrSKCWsEc-c6ya598vsml
title=crifan - Bing 词典
效果:

转载请注明:在路上 » 【已解决】Selenium中Python解析百度搜索结果第一页获取标题列表

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
80 queries in 0.191 seconds, using 22.11MB memory