之前已抓取的数据:
【已解决】汽车之家车型车系数据:支持旧版车系页面
中,后来发现缺少了部分数据
现在去研究看看原因
以品牌红旗为例
在售车型为例
红旗H5,在售车型12款
2020款相关数据没有抓取到
2020款 1.5T DCT旗悦版
去找找结果数据中,是否的确缺失 2020款红旗H5的数据
-》
果然是:
此处年份是null
对应打开
就是2020款
所以去调试找原因
后来发现了,是多个不同for层,car的dict没有copy,而直接赋值
导致之前实际上已抓取的到的的2020款数据:
------------------------------ [0] ------------------------------ carModelYear=2020款 carModelEmissionStandards=国VI carModelPower=1.5T carModelGearBox=7挡双离合 carModelName=2020款 1.5T DCT旗悦版 carModelSpecUrl=https://www.autohome.com.cn/spec/46112/#pvareaid=3454492 typeDefaultListDoc=<generator object PyQuery.items at 0x10ae5cba0> typeDefaultList=[[<span.type-default>], [<span.type-default>]] spanTypeDefault0=<span class="type-default">前置前驱</span> carModelDriveType=前置前驱 spanTypeDefault1=<span class="type-default">7挡双离合</span> carModelGearBox=7挡双离合 carModelMsrp=14.58万
后续被替换掉,变成2019款的数据了:
去修改代码
核心部分:
carModelDict = copy.deepcopy(carSeriesDict) for curLiIdx, eachHatADoc in enumerate(haltAListDoc): curHaltCarDict = copy.deepcopy(carModelDict) ... for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapListDoc): print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30)) curSpecWrapCarDict = copy.deepcopy(carModelDict) ... for curDlIdx, eachDlDoc in enumerate(dlDocList): print("%s [%d] %s" % ('='*30, curDlIdx, '='*30)) curDlCarDict = copy.deepcopy(curSpecWrapCarDict) ... for curDdIdx, eachDdDoc in enumerate(ddListDoc): print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30)) curDdCarDict = copy.deepcopy(curDlCarDict) ... self.send_message(self.project_name, curDdCarDict, url=carModelSpecUrl)
即可正常输出car信息:
抓取到2020款信息
[ [ "autohome_20200819", { "carBrandId": "91", "carBrandLogoUrl": "https://car3.autoimg.cn/cardfs/series/g26/M05/AE/94/100x100_f40_autohomecar__wKgHEVs9tm6ASWlTAAAUz_2mWTY720.png", "carBrandName": "红旗", "carMerchantName": "一汽红旗", "carMerchantUrl": "https://car.autohome.com.cn/price/brand-91-190.html#pvareaid=2042363", "carModelDriveType": "前置前驱", "carModelEmissionStandards": "国VI", "carModelGearBox": "7挡双离合", "carModelGroupName": "1.5升 涡轮增压 169马力 国VI", "carModelMsrp": "14.58万", "carModelName": "2020款 1.5T DCT旗悦版", "carModelPower": "1.5T", "carModelSpecUrl": "https://www.autohome.com.cn/spec/46112/#pvareaid=3454492", "carModelYear": "2020款", "carSeriesId": "4410", "carSeriesLevelId": "4", "carSeriesLevelName": "中型车", "carSeriesMainImgUrl": "https://car2.autoimg.cn/cardfs/product/g3/M04/92/40/380x285_0_q87_autohomecar__ChsEkV8G1BiAFN2JAAlzGHoYv9M868.jpg", "carSeriesMaxPrice": "19.08", "carSeriesMinPrice": "14.58", "carSeriesMsrp": "14.58-19.08万", "carSeriesMsrpUrl": "https://www.autohome.com.cn/4410/price.html#pvareaid=101446", "carSeriesName": "红旗H5", "carSeriesUrl": "https://www.autohome.com.cn/4410/#levelsource=000000000_0&pvareaid=101594" }, "https://www.autohome.com.cn/spec/46112/#pvareaid=3454492" ],
然后再去重新运行,估计就可以了。
另外,再去优化一些细节,比如:
支持部分页面 电动车的type-default时3个的情况:
代码:
if typeDefaultList: """ 正常: <p> <span class="type-default">前置前驱</span> <span class="type-default">7挡双离合</span> </p> 特殊: https://www.autohome.com.cn/4605/ <p> <span class="type-default">电动</span> <span class="type-default">前置前驱</span> <span class="type-default">AMT(组合10挡)</span> </p> """ # spanTypeDefault0 = typeDefaultList[0] spanTypeDefault0 = typeDefaultList[-2] print("spanTypeDefault0=%s" % spanTypeDefault0) carModelDriveType = spanTypeDefault0.text() print("carModelDriveType=%s" % carModelDriveType) # spanTypeDefault1 = typeDefaultList[1] spanTypeDefault1 = typeDefaultList[-1] print("spanTypeDefault1=%s" % spanTypeDefault1) carModelGearBox = spanTypeDefault1.text() print("carModelGearBox=%s" % carModelGearBox)
以及:
当sift不存在 -> model的year是空时
从 model的name中提取 year
if not curDdCarDict["carModelYear"]: foundYearType = re.search("(?P<yearType>\d{4}款)", carModelName) if foundYearType: yearType = foundYearType.group("yearType") print("yearType=%s" % yearType) carModelYear = yearType print("extract year=%s from modelName=%s" % (carModelYear, carModelName)) curDdCarDict["carModelYear"] = carModelYear
即可从:
2019款 50T Pro
提取出:
2019款
另外顺带优化了整个代码结构
把能提取出函数的部分,都提取出来了
便于后续回看代码逻辑,方便调试
最后完整代码是:
#!/usr/bin/env python # -*- encoding: utf-8 -*- # Created on 2020-08-25 21:48:28 # Project: autohome_20200825 import string import re import copy from lxml import etree from pyspider.libs.base_handler import * AutohomeHost = "https://www.autohome.com.cn" CarSpecPrefix = "%s/spec" % AutohomeHost # "https://www.autohome.com.cn/spec/%s/" class Handler(BaseHandler): UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36" crawl_config = { "headers": { "User-Agent": UserAgent_Mac_Chrome, } } def genSpecUrl(self, specId): # return "%s/%s" % (CarSpecPrefix, specId) return "%s/%s/" % (CarSpecPrefix, specId) def genConfigSpecUrl(self, specId): configSpecTemplate = "https://car.autohome.com.cn/config/spec/%s.html" # https://car.autohome.com.cn/config/spec/43593.html return configSpecTemplate % specId def to10KPrice(self, originPrice): tenKPrice = "" # 19.08 / '19.08' -> '19.08万' if isinstance(originPrice, str): tenKPrice = "%s万" % originPrice elif isinstance(originPrice, float): tenKPrice = "%.2f万" % originPrice elif isinstance(originPrice, int): tenKPrice = "%s.00万" % originPrice return tenKPrice def extractSpecId(self, specUrl): carSpedId = "" # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492 # https://www.autohome.com.cn/spec/2304/ foundSpecId = re.search("spec/(?P<specId>\d+)", specUrl) print("foundSpecId=%s" % foundSpecId) if foundSpecId: carSpedId = foundSpecId.group("specId") print("carSpedId=%s" % carSpedId) return carSpedId # @every(minutes=24 * 60) def on_start(self): # autohomeEntryUrl = "https://www.autohome.com.cn/car/" # self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback) for eachLetter in list(string.ascii_lowercase): letterUpper = eachLetter.upper() # # for debug # letterUpper = "H" print("letterUpper=%s" % letterUpper) self.crawl("https://www.autohome.com.cn/grade/carhtml/%s.html" % eachLetter, save={"initials": letterUpper}, callback=self.gradCarHtmlPage) # # @config(age=10 * 24 * 60 * 60) # def carBrandListCallback(self, response): # print("response.url=%s" % response.url) # # <div vos="gs" class="uibox" id="boxA" style=""> # for eachVosGs in response.doc('div[vos="gs"]').items(): # print("eachVosGs=%s" % eachVosGs) # # self.crawl(each.attr.href, callback=self.detail_page) # # @config(priority=2) # def detail_page(self, response): # return { # "url": response.url, # "title": response.doc('title').text(), # } @catch_status_code_error def gradCarHtmlPage(self, response): print("gradCarHtmlPage: response=", response) # picSeriesItemList = response.doc('.rank-list-ul li div a[href*="/pic/series"]').items() # print("picSeriesItemList=", picSeriesItemList) # print("len(picSeriesItemList)=%s"%(len(picSeriesItemList))) # for each in picSeriesItemList: # self.crawl(each.attr.href, callback=self.picSeriesPage) saveDict = response.save print("saveDict=", saveDict) initials = saveDict["initials"] print("initials=", initials) respText = response.text # print("respText=", respText) """ <dl id="33" olr="6"> <dt><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50" src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a> <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div> </dt> """ # brandDoc = response.doc('dl dt') # print("brandDoc=%s" % brandDoc) # brandListDoc = response.doc('dl[id and orl] dt') # dlListDoc = response.doc('dl[id and orl]').items() # dlListDoc = response.doc("dl[id*=''][orl*='']").items() # dlListDoc = response.doc("dl[orl*='']").items() # dlListDoc = response.doc("dl").items() # dlListDoc = response.doc("dl:regex(id, \d+)").items() # dlListDoc = response.doc("dl:regex(id,[0-9])").items() # dlListDoc = response.doc("dl[id]").items() dlListDoc = response.doc("dl[olr]").items() print("type(dlListDoc)=%s" % type(dlListDoc)) dlList = list(dlListDoc) print("len(dlList)=%s" % len(dlList)) print("dlList=%s" % dlList) for curBrandIdx, eachDlDoc in enumerate(dlList): print("%s [%d] %s" % ('#'*30, curBrandIdx, '#'*30)) dtDoc = eachDlDoc.find("dt") # print("dtDoc=%s" % dtDoc) # <a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50" src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a> brandLogoDoc = dtDoc.find('a img') # print("brandLogoDoc=%s" % brandLogoDoc) carBrandLogoUrl = brandLogoDoc.attr["src"] print("carBrandLogoUrl=%s" % carBrandLogoUrl) # <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div> brandNameDoc = dtDoc.find('div a') # print("brandNameDoc=%s" % brandNameDoc) carBrandName = brandNameDoc.text() print("carBrandName=%s" % carBrandName) # <div class="h3-tit"><a href="//car.autohome.com.cn/price/brand-33-9.html#pvareaid=2042363">一汽-大众奥迪</a></div> # merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items() # ddDoc = eachDlDoc.find("dd") ddDoc = eachDlDoc.find("dd") # print("ddDoc=%s" % ddDoc) merchantDocGenerator = ddDoc.items("div[class='h3-tit'] a") merchantDocList = list(merchantDocGenerator) # print("merchantDocList=%s" % merchantDocList) merchantDocLen = len(merchantDocList) print("merchantDocLen=%s" % merchantDocLen) # <ul class="rank-list-ul" 0> # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']") # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items() merchantRankDocGenerator = ddDoc.items("ul[class='rank-list-ul']") merchantRankDocList = list(merchantRankDocGenerator) # print("merchantRankDocList=%s" % merchantRankDocList) merchantRankDocListLen = len(merchantRankDocList) print("merchantRankDocListLen=%s" % merchantRankDocListLen) for curIdx, merchantItem in enumerate(merchantDocList): # for curIdx, merchantItem in enumerate(merchantDocGenerator): # print("%s" % "="*80) print("%s [%d] %s" % ('='*30, curIdx, '='*30)) # print("type(merchantItem)=%s" % type(merchantItem)) # print("[%d] merchantItem=%s" % (curIdx, merchantItem)) # print("[%d] merchantItem=%s" % (curIdx, merchantItem)) carMerchantName = merchantItem.text() print("carMerchantName=%s" % carMerchantName) merchantItemAttr = merchantItem.attr # print("merchantItemAttr=%s" % merchantItemAttr) carMerchantUrl = merchantItemAttr["href"] print("carMerchantUrl=%s" % carMerchantUrl) # curSubBrandDict = { # "brandName": brandName, # "carBrandLogoUrl": carBrandLogoUrl, # "carMerchantName": carMerchantName, # "carMerchantUrl": carMerchantUrl, # } # self.send_message(self.project_name, curSubBrandDict, url=carMerchantUrl) merchantRankDoc = merchantRankDocList[curIdx] # print("merchantRankDoc=%s" % merchantRankDoc) # print("type(merchantRankDoc)=%s" % type(merchantRankDoc)) # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'> # merchantRankHtml = etree.tostring(merchantRankDoc) # type(merchantRankDoc)=<class 'pyquery.pyquery.PyQuery'> # merchantRankHtml = merchantRankDoc.html() # print("merchantRankHtml=%s" % merchantRankHtml) # <li id="s3170"> # carSeriesDocGenerator = merchantRankDoc.find("li") # carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']") carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']") # print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator)) carSeriesDocList = list(carSeriesDocGenerator) # print("type(carSeriesDocList)=%s" % type(carSeriesDocList)) # print("carSeriesDocList=%s" % carSeriesDocList) carSeriesDocListLen = len(carSeriesDocList) # print("carSeriesDocListLen=%s" % carSeriesDocListLen) for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList): print("%s [%d] %s" % ('-'*30, curSeriesIdx, '-'*30)) # print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc)) # print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'> # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4> carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a") # print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc)) # print("carSeriesInfoDoc=%s" % carSeriesInfoDoc) carSeriesName = carSeriesInfoDoc.text() print("carSeriesName=%s" % carSeriesName) carSeriesUrl = carSeriesInfoDoc.attr.href print("carSeriesUrl=%s" % carSeriesUrl) # <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div> # 厂商指导价=厂商建议零售价格=MSRP=Manufacturer's Suggested Retail Price # carSeriesMsrpDoc = eachCarSeriesDoc.find("div a") carSeriesMsrpDoc = eachCarSeriesDoc.find("div a[class='red']") # print("carSeriesMsrpDoc=%s" % carSeriesMsrpDoc) carSeriesMsrp = carSeriesMsrpDoc.text() print("carSeriesMsrp=%s" % carSeriesMsrp) carSeriesMsrpUrl = carSeriesMsrpDoc.attr.href print("carSeriesMsrpUrl=%s" % carSeriesMsrpUrl) carSeriesDict = { "carBrandName": carBrandName, "carBrandLogoUrl": carBrandLogoUrl, "carMerchantName": carMerchantName, "carMerchantUrl": carMerchantUrl, "carSeriesName": carSeriesName, "carSeriesUrl": carSeriesUrl, "carSeriesMsrp": carSeriesMsrp, "carSeriesMsrpUrl": carSeriesMsrpUrl, } # self.send_message(self.project_name, carSeriesDict, url=carSeriesUrl) self.crawl(carSeriesUrl, callback=self.carSeriesDetailPage, save=carSeriesDict, ) def on_message(self, project, msg): print("on_message: msg=%s" % msg) return msg @catch_status_code_error def carSeriesDetailPage(self, response): carSeriesDict = response.save print("carSeriesDict=%s" % carSeriesDict) carSeriesUrl = response.url print("carSeriesUrl=%s" % carSeriesUrl) carSeriesMainImgUrl = "" carSeriesId = "" carSeriesLevelId = "" carSeriesMsrp = "" carSeriesMinPrice = "" carSeriesMaxPrice = "" carSeriesHtml = response.text print("type(carSeriesHtml)=%s" % type(carSeriesHtml)) # <class 'str'> # print("carSeriesHtml=%s" % carSeriesHtml) foundLevelId = re.search("var\s+levelid\s+=", carSeriesHtml) print("foundLevelId=%s" % foundLevelId) isNewLayoutHtml = bool(foundLevelId) print("isNewLayoutHtml=%s" % isNewLayoutHtml) foundShowCityId = re.search("var\s+showCityId\s+=", carSeriesHtml) print("foundShowCityId=%s" % foundShowCityId) isOldLayoutHtml = bool(foundShowCityId) print("isOldLayoutHtml=%s" % isOldLayoutHtml) if isOldLayoutHtml: # Q开头 # https://www.autohome.com.cn/grade/carhtml/q.html # -> # 东风悦达起亚-千里马 # https://www.autohome.com.cn/142/#levelsource=000000000_0&pvareaid=101594 # 其他: # # 一汽丰田-花冠 # https://www.autohome.com.cn/109/#levelsource=000000000_0&pvareaid=101594 # # 昶洧-昶洧 SUV # https://www.autohome.com.cn/4550/#levelsource=000000000_0&pvareaid=101594 """ <div class="car_detail " id="tab1-2"> <div class="models"> <!--年代--> <div class="header"> <div class="car_price"> <span class="years">2005款</span> <span class="price">指导价(停售):<strong class="red">6.28万-9.18万</strong></span> <span class="price">二手车价格:<strong class="red"><a class='cd60000' href='//www.che168.com/china/qiya/qianlima/a0_0msdgscncgpiltocsp1exs276/?pvareaid=103693'>0.39万-1.30万</a></strong></span> 。。。 <div class="car_detail current" id="tab1-1"> <div class="models"> <!--年代--> <div class="header"> <div class="car_price"> <span class="years">2006款</span> <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span> 。。。 """ carDetailDivGenerator = response.doc("div[class^='car_detail']").items() print("carDetailDivGenerator=%s" % carDetailDivGenerator) carDetailDivList = list(carDetailDivGenerator) print("carDetailDivList=%s" % carDetailDivList) for curDivIdx, eachCarDetailDoc in enumerate(carDetailDivList): print("%s [%d] %s" % ('#'*30, curDivIdx, '#'*30)) if curDivIdx == 0: # use first car model as series: main img, msrp, ... """ <div class="models_info"> <dl class='models_pics'> <dt><a href='//car.autohome.com.cn/photolist/series/2305/23796.html?pvareaid=101468'><img src='https://car0.autoimg.cn/upload/spec/1344/t_1344388912334.jpg' width='240' height='180' /></a></dt> """ # modelMainImgDocListGenerator = response.doc("div[class='models_info'] dl[class='models_pics'] dt a img").items() # modelMainImgDocList = list(modelMainImgDocListGenerator) # firstModelMainImgDoc = modelMainImgDocList[0] firstModelMainImgDoc = eachCarDetailDoc.find("div[class='models_info'] dl[class='models_pics'] dt a img") firstModelMainImgUrl = firstModelMainImgDoc.attr["src"] print("firstModelMainImgUrl=%s" % firstModelMainImgUrl) carSeriesMainImgUrl = firstModelMainImgUrl print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl) carSeriesDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl # <div class="car_price"> # <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span> carPriceStrongDocGenerator = eachCarDetailDoc.items("div[class='car_price'] span[class='price'] strong[class='red']") print("carPriceStrongDocGenerator=%s" % carPriceStrongDocGenerator) if carPriceStrongDocGenerator: carPriceStrongDocList = list(carPriceStrongDocGenerator) print("carPriceStrongDocList=%s" % carPriceStrongDocList) carPriceStrongDoc = carPriceStrongDocList[0] print("carPriceStrongDoc=%s" % carPriceStrongDoc) carPriceMinMax = carPriceStrongDoc.text() print("carPriceMinMax=%s" % carPriceMinMax) if carPriceMinMax: foundMinMax = re.search("(?P<minPrice>[\d\.]+)万-(?P<maxPrice>[\d\.]+)万", carPriceMinMax) print("foundMinMax=%s" % foundMinMax) if foundMinMax: minPrice = foundMinMax.group("minPrice") print("minPrice=%s" % minPrice) minPriceFloat = float(minPrice) print("minPriceFloat=%s" % minPriceFloat) maxPrice = foundMinMax.group("maxPrice") print("maxPrice=%s" % maxPrice) maxPriceFloat = float(maxPrice) print("maxPriceFloat=%s" % maxPriceFloat) averageMsrcPrice = (minPriceFloat + maxPriceFloat) / 2.0 print("averageMsrcPrice=%s" % averageMsrcPrice) # carSeriesMsrp = "%.2f万" % averageMsrcPrice carSeriesMsrp = self.to10KPrice(averageMsrcPrice) print("carSeriesMsrp=%s" % carSeriesMsrp) # carSeriesMinPrice = "%.2f万" % minPriceFloat carSeriesMinPrice = self.to10KPrice(minPriceFloat) print("carSeriesMinPrice=%s" % carSeriesMinPrice) # carSeriesMaxPrice = "%.2f万" % maxPriceFloat carSeriesMaxPrice = self.to10KPrice(maxPriceFloat) print("carSeriesMaxPrice=%s" % carSeriesMaxPrice) carSeriesDict["carSeriesMsrp"] = carSeriesMsrp carSeriesDict["carSeriesMinPrice"] = carSeriesMinPrice carSeriesDict["carSeriesMaxPrice"] = carSeriesMaxPrice print("") self.processSingleCarDetailDiv(carSeriesDict, eachCarDetailDoc) elif isNewLayoutHtml: carModelDict = copy.deepcopy(carSeriesDict) # carSeriesUrl=https://www.autohome.com.cn/2123/#levelsource=000000000_0&pvareaid=101594 foundSeriesId = re.search("www\.autohome\.com\.cn/(?P<seriesId>\d+)/", carSeriesUrl) carSeriesId = foundSeriesId.group("seriesId") # carSeriesId = int(carSeriesId) print("carSeriesId=%s" % carSeriesId) # 2123 carModelDict["carSeriesId"] = carSeriesId """ <div class="information-pic"> <div class="pic-main"> 。。。 <picture> 。。。 <img sizes="380px" width="380" height="285" src="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg" srcset="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 380w, //car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/760x570_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 760w"> </picture> """ mainImgDoc = response.doc("div[class='information-pic'] div[class='pic-main'] picture img") print("mainImgDoc=%s" % mainImgDoc) carSeriesMainImgUrl = mainImgDoc.attr["src"] print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl) carModelDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl """ <script type="text/javascript"> 。。。 var seriesid = '2123'; var seriesname='哈弗H6'; var yearid = '0'; var brandid = '181'; var levelid = '17'; var levelname='紧凑型SUV'; var fctid = '4'; var SeriesMinPrice='9.80'; var SeriesMaxPrice='14.10'; """ infoKeyList = [ "seriesid", # "seriesname", # has got # "yearid", # no need "brandid", "levelid", "levelname", # "fctid", # unknown meaning "SeriesMinPrice", "SeriesMaxPrice", ] InfoDict = {} for eachInfoKey in infoKeyList: curPattern = "var\s+%s\s*=\s*'(?P<infoValue>[^']+)'\s*;" % eachInfoKey print("curPattern=%s" % curPattern) foundInfo = re.search(curPattern, carSeriesHtml) print("foundInfo=%s" % foundInfo) # if foundInfo: infoValue = foundInfo.group("infoValue") print("infoValue=%s" % infoValue) InfoDict[eachInfoKey] = infoValue print("InfoDict=%s" % InfoDict) # if "seriesid" in InfoDict: carSeriesId = InfoDict["seriesid"] # 2123 carModelDict["carSeriesId"] = carSeriesId # carModelDict["carSeriesName"] = InfoDict["seriesname"] # 哈弗H6 # if "brandid" in InfoDict: carModelDict["carBrandId"] = InfoDict["brandid"] # 181 # if "levelid" in InfoDict: carSeriesLevelId = InfoDict["levelid"] # 17 carModelDict["carSeriesLevelId"] = carSeriesLevelId # if "levelname" in InfoDict: carModelDict["carSeriesLevelName"] = InfoDict["levelname"] # 紧凑型SUV # if "SeriesMinPrice" in InfoDict: carSeriesMinPrice = InfoDict["SeriesMinPrice"] # 9.80 carModelDict["carSeriesMinPrice"] = self.to10KPrice(carSeriesMinPrice) # if "SeriesMaxPrice" in InfoDict: carSeriesMaxPrice = InfoDict["SeriesMaxPrice"] # 14.10 carModelDict["carSeriesMaxPrice"] = self.to10KPrice(carSeriesMaxPrice) """ <div class="series-list"> 。。。 <li class="more-dropdown"> <a href="javascript:void(0);" target="_self" data-toggle="tab" class="tab-disabled" data-target="#specWrap-3">停售款 <i class="athm-iconfont athm-iconfont-arrowdown"></i></a> <ul class="dropdown-con" id="haltList"> <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="11691">2019款</a></li> ... <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="3100">2011款</a></li> </ul> </li> """ haltADocGenerator = response.doc("li[class='more-dropdown'] ul[id='haltList'] li a").items() print("type(haltADocGenerator)=%s" % type(haltADocGenerator)) print("haltADocGenerator=%s" % haltADocGenerator) haltADocList = list(haltADocGenerator) print("haltADocList=%s" % haltADocList) for curLiIdx, eachHatADoc in enumerate(haltADocList): print("%s [%d] %s" % ('%'*30, curLiIdx, '%'*30)) self.processSingleHaltA(carModelDict, eachHatADoc) # """ # <div class="information-summary"> # <dl class="information-price"> # ... # <dd class="type"> # <span class="type__item">紧凑型车</span> # """ # carLevelDoc = response.doc("div[class='information-summary'] dl[class='information-price'] dd[class='type'] span[class='type__item']").eq(0) # print("carLevelDoc=%s" % carLevelDoc) # carSeriesLevelName = carLevelDoc.text() # print("carSeriesLevelName=%s" % carSeriesLevelName) # carModelDict["carSeriesLevelName"] = carSeriesLevelName carSeriesContentDoc = response.doc("div[class='series-content']") # print("carSeriesContentDoc=%s" % carSeriesContentDoc) # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap']") # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap active']") carSpecWrapDocGenerator = carSeriesContentDoc.items("div[class^='spec-wrap']") print("carSpecWrapDocGenerator=%s" % carSpecWrapDocGenerator) carSpecWrapDocList = list(carSpecWrapDocGenerator) print("carSpecWrapDocList=%s" % carSpecWrapDocList) for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapDocList): print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30)) self.processSingleSpecWrapDiv(carModelDict, eachSpecWrapDoc) def processSingleCarDetailDiv(self, carSeriesDict, curCarDetailDoc): print("in processSingleCarDetailDiv") curCarModelGroupDict = copy.deepcopy(carSeriesDict) # <span class="years">2006款</span> modelYearDoc = curCarDetailDoc.find("span[class='years']") print("modelYearDoc=%s" % modelYearDoc) carModelYear = modelYearDoc.text() print("carModelYear=%s" % carModelYear) curCarModelGroupDict["carModelYear"] = carModelYear """ <div class="modelswrap"> <!-- 信息 start --> <div class="models_info"> <dl class='models_prop'> <dt>发动机:</dt> <dd><span>1.3L</span><span>1.6L</span></dd> </dl> <dl class='models_prop'> <dt>变速箱:</dt> <dd><span>手动</span><span>自动</span></dd> <dt>车身结构:</dt> <dd><span>三厢</span></dd> </dl> """ # modelsPropDdList = curCarDetailDoc.find("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd") modelsPropDdGenerator = curCarDetailDoc.items("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd") print("modelsPropDdGenerator=%s" % modelsPropDdGenerator) modelsPropDdList = list(modelsPropDdGenerator) print("modelsPropDdList=%s" % modelsPropDdList) engineValueDoc = modelsPropDdList[0] print("engineValueDoc=%s" % engineValueDoc) engineValue = engineValueDoc.text() print("engineValue=%s" % engineValue) gearBoxValueDoc = modelsPropDdList[1] print("gearBoxValueDoc=%s" % gearBoxValueDoc) gearBoxValue = gearBoxValueDoc.text() print("gearBoxValue=%s" % gearBoxValue) bodyStructureValueDoc = modelsPropDdList[2] print("bodyStructureValueDoc=%s" % bodyStructureValueDoc) bodyStructureValue = bodyStructureValueDoc.text() print("bodyStructureValue=%s" % bodyStructureValue) carModelGearBox = gearBoxValue print("carModelGearBox=%s" % carModelGearBox) curCarModelGroupDict["carModelGearBox"] = carModelGearBox # 手动自动 curCarModelGroupDict["carModelDriveType"] = "" curCarModelGroupDict["carModelEmissionStandards"] = "" carModelPower = engineValue print("carModelPower=%s" % carModelPower) curCarModelGroupDict["carModelPower"] = carModelPower carModelGroupName = "%s %s %s" % (engineValue, gearBoxValue, bodyStructureValue) print("carModelGroupName=%s" % carModelGroupName) curCarModelGroupDict["carModelGroupName"] = carModelGroupName """ <table class='models_tab tableline' cellspacing='0' cellpadding='0' border='0'> <tr> <td class='name_d'> <div class='name'><a title='2006款 1.6L MT特别版GL' href='spec/2304/'>2006款 1.6L MT特别版GL</a></div> </td> <td class='price_d'> <div class='price01'>8.18万</div> </td> """ modelsTrDocGenerator = curCarDetailDoc.items("table[class^='models_tab'] tr") print("modelsTrDocGenerator=%s" % modelsTrDocGenerator) modelsTrDocList = list(modelsTrDocGenerator) print("modelsTrDocList=%s" % modelsTrDocList) for curTabIdx, eachModelTrDoc in enumerate(modelsTrDocList): print("%s [%d] %s" % ('='*30, curTabIdx, '='*30)) self.processSingleModelsTr(curCarModelGroupDict, eachModelTrDoc) def processSingleModelsTr(self, curCarModelGroupDict, curModelTrDoc): curTrCarModeDict = copy.deepcopy(curCarModelGroupDict) print("curModelTrDoc=%s" % curModelTrDoc) nameADoc = curModelTrDoc.find("td[class='name_d'] div[class='name'] a") print("nameADoc=%s" % nameADoc) carModelName = nameADoc.text() print("carModelName=%s" % carModelName) carModelSpecUrl = nameADoc.attr["href"] # bug -> wrong url: # https://www.autohome.com.cn/142/spec/2304/ # need repace # https://www.autohome.com.cn/142/spec/2304/ # to # https://www.autohome.com.cn/spec/2304/ foundSpecId = re.search("spec/(?P<specId>\d+)", carModelSpecUrl) carModelSpecId = foundSpecId.group("specId") print("carModelSpecId=%s" % carModelSpecId) # 2304 carModelSpecUrl = self.genSpecUrl(carModelSpecId) print("carModelSpecUrl=%s" % carModelSpecUrl) priceDivDoc = curModelTrDoc.find("td[class='price_d'] div[class='price01']") print("priceDivDoc=%s" % priceDivDoc) carModelMsrp = priceDivDoc.text() print("carModelMsrp=%s" % carModelMsrp) if "暂无" in carModelMsrp: carModelMsrp = "" print("carModelMsrp=%s" % carModelMsrp) curTrCarModeDict["carModelName"] = carModelName curTrCarModeDict["carModelSpecUrl"] = carModelSpecUrl curTrCarModeDict["carModelMsrp"] = carModelMsrp self.send_message(self.project_name, curTrCarModeDict, url=carModelSpecUrl) # self.processCarSpecConfig(curTrCarModeDict) def processSingleHaltA(self, carModelDict, curHatADoc): curHaltCarDict = copy.deepcopy(carModelDict) print("curHatADoc=%s" % curHatADoc) yearName = curHatADoc.text() print("yearName=%s" % yearName) yearId = curHatADoc.attr["data-yearid"] print("yearId=%s" % yearId) # getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (curHaltCarDict["carSeriesId"], yearId, curHaltCarDict["carSeriesLevelId"]) carSeriesId = curHaltCarDict["carSeriesId"] carSeriesLevelId = curHaltCarDict["carSeriesLevelId"] if carSeriesId and carSeriesLevelId: getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carSeriesId, yearId, carSeriesLevelId) # https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=2123&syearid=10379&levelid=17 print("getHaltSpecUrl=%s" % getHaltSpecUrl) self.crawl(getHaltSpecUrl, callback=self.haltCarSpecCallback, save=curHaltCarDict, ) def processSingleSpecWrapDiv(self, curCarModelDict, curSpecWrapDoc): curSpecWrapCarDict = copy.deepcopy(curCarModelDict) # print("curSpecWrapDoc=%s" % curSpecWrapDoc) """ <!--即将上市 start--> <div class="spec-wrap active" id="specWrap-1"> <dl class="halt-spec"> <dt> <div class="spec-name"> <span>参数配置未公布</span> </div> <dl class="halt-spec"> <dt> <div class="spec-name"> <span>1.5升 涡轮增压 169马力 国VI</span> </div> """ # dlDoc = curSpecWrapDoc.find("dl[class='']") # dlDoc = curSpecWrapDoc.find("dl") dlListDocGenerator = curSpecWrapDoc.items("dl") print("dlListDocGenerator=%s" % dlListDocGenerator) dlDocList = list(dlListDocGenerator) print("dlDocList=%s" % dlDocList) for curDlIdx, eachDlDoc in enumerate(dlDocList): print("%s [%d] %s" % ('='*30, curDlIdx, '='*30)) self.processSingleSpecDl(curSpecWrapCarDict, eachDlDoc) def processSingleSpecDl(self, curSpecWrapCarDict, curDlDoc): curDlCarDict = copy.deepcopy(curSpecWrapCarDict) # print("curDlDoc=%s" % curDlDoc) """ <dt> <div class="spec-name"> <span>1.5升 涡轮增压 169马力 国VI</span> """ dtDoc = curDlDoc.find("dt") # print("dtDoc=%s" % dtDoc) groupSpecNameSpanDoc = dtDoc.find("div[class='spec-name'] span") print("groupSpecNameSpanDoc=%s" % groupSpecNameSpanDoc) carModelGroupName = "" if groupSpecNameSpanDoc: carModelGroupName = groupSpecNameSpanDoc.text() print("carModelGroupName=%s" % carModelGroupName) curDlCarDict["carModelGroupName"] = carModelGroupName # <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class=""> ddListDoc = curDlDoc.items("dd") print("ddListDoc=%s" % ddListDoc) for curDdIdx, eachDdDoc in enumerate(ddListDoc): print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30)) self.processSingleSiftDd(curDlCarDict, eachDdDoc) def processSingleSiftDd(self, curDlCarDict, curDdDoc): print("in processSingleSiftDd") curDdCarDict = copy.deepcopy(curDlCarDict) curDdAttr = curDdDoc.attr """ 正常: <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class=""> ... 特殊: 无sift: <dd data-electricspecid="47050"> """ # print("curDdAttr=%s" % curDdAttr) carModelYear = curDdAttr["data-sift1"] print("carModelYear=%s" % carModelYear) carModelEmissionStandards = curDdAttr["data-sift2"] print("carModelEmissionStandards=%s" % carModelEmissionStandards) carModelPower = curDdAttr["data-sift3"] print("carModelPower=%s" % carModelPower) carModelGearBox = curDdAttr["data-sift4"] print("carModelGearBox=%s" % carModelGearBox) curDdCarDict["carModelYear"] = carModelYear curDdCarDict["carModelEmissionStandards"] = carModelEmissionStandards curDdCarDict["carModelPower"] = carModelPower curDdCarDict["carModelGearBox"] = carModelGearBox """ <div class="spec-name"> <div class="name-param"> <p data-gcjid="41511" id="spec_41511"> <a href="/spec/41511/#pvareaid=3454492" class="name">2020款 1.5GDIT 自动铂金舒适版</a> <span class="athm-badge athm-badge--grey is-plain">停产在售</span> <span class="athm-badge athm-badge--orange">特惠</span></p> <p><span class="type-default">前置前驱</span><span class="type-default">7挡双离合</span></p> </div> </div> """ specNameDoc = curDdDoc.find("div[class='spec-name']") # print("specNameDoc=%s" % specNameDoc) specADoc = specNameDoc.find("p a[class='name']") # print("specADoc=%s" % specADoc) carModelName = specADoc.text() print("carModelName=%s" % carModelName) # 2020款 1.5GDIT 自动铂金舒适版 carModelSpecUrl = specADoc.attr["href"] print("carModelSpecUrl=%s" % carModelSpecUrl) # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492 typeDefaultListDoc = specNameDoc.items("p span[class='type-default']") print("typeDefaultListDoc=%s" % typeDefaultListDoc) typeDefaultList = list(typeDefaultListDoc) print("typeDefaultList=%s" % typeDefaultList) carModelDriveType = "" carModelGearBox = "" if typeDefaultList: """ 正常: <p> <span class="type-default">前置前驱</span> <span class="type-default">7挡双离合</span> </p> 特殊: https://www.autohome.com.cn/4605/ <p> <span class="type-default">电动</span> <span class="type-default">前置前驱</span> <span class="type-default">AMT(组合10挡)</span> </p> """ # spanTypeDefault0 = typeDefaultList[0] spanTypeDefault0 = typeDefaultList[-2] print("spanTypeDefault0=%s" % spanTypeDefault0) carModelDriveType = spanTypeDefault0.text() print("carModelDriveType=%s" % carModelDriveType) # spanTypeDefault1 = typeDefaultList[1] spanTypeDefault1 = typeDefaultList[-1] print("spanTypeDefault1=%s" % spanTypeDefault1) carModelGearBox = spanTypeDefault1.text() print("carModelGearBox=%s" % carModelGearBox) curDdCarDict["carModelName"] = carModelName if not curDdCarDict["carModelYear"]: foundYearType = re.search("(?P<yearType>\d{4}款)", carModelName) if foundYearType: yearType = foundYearType.group("yearType") print("yearType=%s" % yearType) carModelYear = yearType print("extract year=%s from modelName=%s" % (carModelYear, carModelName)) curDdCarDict["carModelYear"] = carModelYear curDdCarDict["carModelSpecUrl"] = carModelSpecUrl curDdCarDict["carModelDriveType"] = carModelDriveType # 前置前驱 curDdCarDict["carModelGearBox"] = carModelGearBox # 7挡双离合 """ <div class="spec-guidance"> <p class="guidance-price"> <span>10.40万</span> <a href="//j.autohome.com.cn/pc/carcounter?type=1&specId=41511&pvareaid=3454617"><i class="athm-iconpng athm-iconpng-calculator"></i></a> </p> </div> <div class="spec-guidance"> <p class="guidance-price"> <span><span>暂无</span></span> """ specGuidanceDoc = curDdDoc.find("div[class='spec-guidance']") # print("specGuidanceDoc=%s" % specGuidanceDoc) guidancePriceSpanDoc = specGuidanceDoc.find("p[class='guidance-price'] span") # print("guidancePriceSpanDoc=%s" % guidancePriceSpanDoc) carModelMsrp = guidancePriceSpanDoc.text() print("carModelMsrp=%s" % carModelMsrp) if "暂无" in carModelMsrp: carModelMsrp = "" print("carModelMsrp=%s" % carModelMsrp) curDdCarDict["carModelMsrp"] = carModelMsrp self.send_message(self.project_name, curDdCarDict, url=carModelSpecUrl) # self.processCarSpecConfig(curDdCarDict) @catch_status_code_error def haltCarSpecCallback(self, response): prevCarModelDict = response.save carModelDict = copy.deepcopy(prevCarModelDict) print("carModelDict=%s" % carModelDict) respJson = response.json print("respJson=%s" % respJson) """ [ { "name": "1.5升 涡轮增压 169马力", "speclist": [ { "specid": 36955, "specname": "2019款 红标 1.5GDIT 自动舒适版", "specstate": 40, "minprice": 102000, "maxprice": 102000, "fueltype": 1, "fueltypedetail": 1, "driveform": "前置前驱", "drivetype": "前驱", "gearbox": "7挡双离合", "evflag": "", "newcarflag": "", "subsidy": "", "paramisshow": 1, "videoid": 0, "link2sc": "http://www.che168.com/china/hafu/hafuh6/7_8/", "price2sc": "7.58万", "price": "10.20万", "syear": 2019 }, { "specid": 36956, "specname": "2019款 红标 1.5GDIT 自动都市版", "specstate": 40, "minprice": 109000, "maxprice": 109000, "fueltype": 1, "fueltypedetail": 1, "driveform": "前置前驱", "drivetype": "前驱", "gearbox": "7挡双离合", "evflag": "", "newcarflag": "", "subsidy": "", "paramisshow": 1, "videoid": 0, "link2sc": "", "price2sc": "", "price": "10.90万", "syear": 2019 }, ... """ if respJson: for eachModelGroupDict in respJson: modelGroupName = eachModelGroupDict["name"] modelSpecList = eachModelGroupDict["speclist"] for eachModelDict in modelSpecList: curCarModelDict = copy.deepcopy(carModelDict) carModelYear = "%s款" % eachModelDict["syear"] # carModelSpecUrl = "%s/%s" % (CarSpecPrefix, eachModelDict["specid"]) carModelSpecUrl = self.genSpecUrl(eachModelDict["specid"]) curCarModelDict["carModelGroupName"] = modelGroupName curCarModelDict["carModelYear"] = carModelYear curCarModelDict["carModelEmissionStandards"] = "" curCarModelDict["carModelPower"] = "" curCarModelDict["carModelDriveType"] = eachModelDict["drivetype"] curCarModelDict["carModelGearBox"] = eachModelDict["gearbox"] curCarModelDict["carModelName"] = eachModelDict["specname"] curCarModelDict["carModelSpecUrl"] = carModelSpecUrl curCarModelDict["carModelMsrp"] = eachModelDict["price"] self.send_message(self.project_name, curCarModelDict, url=carModelSpecUrl) # self.processCarSpecConfig(curCarModelDict) @catch_status_code_error def processCarSpecConfig(self, curCarModelDict): carModelDict = copy.deepcopy(curCarModelDict) print("processCarSpecConfig: carModelDict=%s" % carModelDict) carModelSpecUrl = carModelDict["carModelSpecUrl"] print("carModelSpecUrl=%s" % carModelSpecUrl) carModelSpecId = self.extractSpecId(carModelSpecUrl) print("carModelSpecId=%s" % carModelSpecId) carModelDict["carModelSpecId"] = carModelSpecId # 43593 carConfigSpecUrl = self.genConfigSpecUrl(carModelSpecId) # https://car.autohome.com.cn/config/spec/43593.html print("carConfigSpecUrl=%s" % carConfigSpecUrl) self.crawl(carConfigSpecUrl, fetch_type="js", callback=self.carConfigSpecCallback, save=carModelDict, ) @catch_status_code_error def carConfigSpecCallback(self, response): curCarModelDict = response.save print("curCarModelDict=%s" % curCarModelDict) carModelDict = copy.deepcopy(curCarModelDict) configSpecHtml = response.text # print("configSpecHtml=%s" % configSpecHtml) # print("") """ <table class="tbcs" id="tab_0" style="width: 932px;"> <tbody> <tr> <th class="cstitle" show="1" pid="tab_0" id="nav_meto_0" colspan="5"> <h3><span>基本参数</span></h3> </th> </tr> <tr data-pnid="1_-1" id="tr_0"> """ tbodyDoc = response.doc("table[id='tab_0'] tbody") print("tbodyDoc=%s" % tbodyDoc) carEnergyType = self.getItemFirstValue(tbodyDoc, 2) # 纯电动 / 燃油 / 插电式混合动力 carModelDict["carEnergyType"] = carEnergyType if carEnergyType == "燃油": # carModelEmissionStandards = print("TODO: 燃油") elif carEnergyType == "纯电动": carReleaseTime = self.getItemFirstValue(tbodyDoc, 3) # 2019.11 carModelDict["carReleaseTime"] = carReleaseTime # 工信部纯电续航里程(km) carMiitEnduranceMileagePureElectric = self.getItemFirstValue(tbodyDoc, 4) # 265 carModelDict["carMiitEnduranceMileagePureElectric"] = carMiitEnduranceMileagePureElectric # 快充时间(小时) carQuickCharge = self.getItemFirstValue(tbodyDoc, 5) # 0.6 carModelDict["carQuickCharge"] = carQuickCharge # 慢充时间(小时) carSlowCharge = self.getItemFirstValue(tbodyDoc, 6) # 17 carModelDict["carSlowCharge"] = carSlowCharge # 快充电量百分比 carQuickChargePercent = self.getItemFirstValue(tbodyDoc, 7) # 80 carModelDict["carQuickChargePercent"] = carQuickChargePercent # 最大功率(kW) carMaxPower = self.getItemFirstValue(tbodyDoc, 8) # 100 carModelDict["carMaxPower"] = carMaxPower # 最大扭矩(N·m) carMaxTorque = self.getItemFirstValue(tbodyDoc, 9) # 290 carModelDict["carMaxTorque"] = carMaxTorque # 电动机(Ps) carHorsePowerElectric = self.getItemFirstValue(tbodyDoc, 10) # 136 carModelDict["carHorsePowerElectric"] = carHorsePowerElectric # 长*宽*高(mm) carSize = self.getItemFirstValue(tbodyDoc, 11) # 4237*1785*1548 carModelDict["carSize"] = carSize # 车身结构 carBodyStructure = self.getItemFirstValue(tbodyDoc, 12) # 5门5座SUV carModelDict["carBodyStructure"] = carBodyStructure # 最高车速(km/h) carMaxSpeed = self.getItemFirstValue(tbodyDoc, 13) # 150 carModelDict["carMaxSpeed"] = carMaxSpeed # 官方0-100km/h加速(s) carOfficialSpeedupTime = self.getItemFirstValue(tbodyDoc, 14) # - carModelDict["carOfficialSpeedupTime"] = carOfficialSpeedupTime # 实测0-100km/h加速(s) carActualTestSpeedupTime = self.getItemFirstValue(tbodyDoc, 15) # - carModelDict["carActualTestSpeedupTime"] = carActualTestSpeedupTime # 实测100-0km/h制动(m) carActualTestBrakeDistance = self.getItemFirstValue(tbodyDoc, 16) # - carModelDict["carActualTestBrakeDistance"] = carActualTestBrakeDistance # 实测续航里程(km) carActualTestEnduranceMileage = self.getItemFirstValue(tbodyDoc, 17) # - carModelDict["carActualTestEnduranceMileage"] = carActualTestEnduranceMileage # 实测快充时间(小时) carActualTestQuickCharge = self.getItemFirstValue(tbodyDoc, 18) # - carModelDict["carActualTestQuickCharge"] = carActualTestQuickCharge # 实测慢充时间(小时) carActualTestSlowCharge = self.getItemFirstValue(tbodyDoc, 19) # - carModelDict["carActualTestSlowCharge"] = carActualTestSlowCharge # 整车质保 firstDivDoc = self.getItemFirstValue(tbodyDoc, 20, isRespDoc=True) # <div>三<span class="hs_kw7_configxv"></span>10<span class="hs_kw1_configxv"></span>公里</div> print("firstDivDoc=%s" % firstDivDoc) firstDivHtml = firstDivDoc.html() # carWholeWarranty = firstDivDoc.text() # 三10公里 print("firstDivHtml=%s" % firstDivHtml) # 三<span class="hs_kw7_configCC"></span>10<span class="hs_kw1_configCC"></span>公里 # carWholeQualityQuarantee = re.sub("[^<>]+(?P<firstSpan><span.+?></span>)[^<>]+(?P<secondSpan><span.+?></span>)[^<>]+", ) foundYearDistance = re.search("(?P<warrantyYear>[^<>]+)<span.+?></span>(?P<distanceNumber>[^<>]+)<span.+?></span>(?P<distanceUnit>[^<>]+)", firstDivHtml) warrantyYear = foundYearDistance.group("warrantyYear") distanceNumber = foundYearDistance.group("distanceNumber") distanceUnit = foundYearDistance.group("distanceUnit") carWholeWarranty = "%s年或%s万%s" % (warrantyYear, distanceNumber, distanceUnit) print("carWholeWarranty=%s" % carWholeWarranty) # 三年或10万公里 carModelDict["carWholeWarranty"] = carWholeWarranty elif carEnergyType == "插电式混合动力": print("TODO: 插电式混合动力") else: errMsg = "TODO: add support %s!" % carEnergyType raise Exception(errMsg) @catch_status_code_error def getItemFirstValue(self, rootDoc, trNumber, isRespDoc=False): """ <tr data-pnid="1_-1" id="tr_2"> <th> <div id="1149"><a href="https://car.autohome.com.cn/baike/detail_7_18_1149.html#pvareaid=2042252">能源类型</a> </div> </th> <td style="background:#F0F3F8;"> <div>纯电动</div> </td> <tr data-pnid="1_-1" id="tr_3"> <th> <div id="0">上市<span class="hs_kw40_configxv"></span></div> </th> <td style="background:#F0F3F8;"> <div>2019.11</div> </td> <td> <div>2019.11</div> </td> <td> <div></div> </td> <td> <div></div> </td> </tr> """ trQuery = "tr[id='tr_%s']" % trNumber # print("trQuery=%s" % trQuery) trDoc = rootDoc.find(trQuery) # print("trDoc=%s" % trDoc) tdDocGenerator = trDoc.items("td") # print("tdDocGenerator=%s" % tdDocGenerator) tdDocList = list(tdDocGenerator) # print("tdDocList=%s" % tdDocList) firstTdDoc = tdDocList[0] # print("firstTdDoc=%s" % firstTdDoc) firstTdDivDoc = firstTdDoc.find("div") print("firstTdDivDoc=%s" % firstTdDivDoc) if isRespDoc: respItem = firstTdDivDoc else: firstItemValue = firstTdDivDoc.text() respItem = firstItemValue print("respItem=%s" % respItem) return respItem
供参考。
然后之前出错的数据的部分,就正常了:
搜:
2020款 1.5T DCT旗悦版
找到是2020款的了:
https://www.autohome.com.cn/spec/46112/#pvareaid=3454492 91 https://car3.autoimg.cn/cardfs/series/g26/M05/AE/94/100x100_f40_autohomecar__wKgHEVs9tm6ASWlTAAAUz_2mWTY720.png 红旗 一汽红旗 https://car.autohome.com.cn/price/brand-91-190.html#pvareaid=2042363 前置前驱 国VI 7挡双离合 1.5升 涡轮增压 169马力 国VI 14.58万 2020款 1.5T DCT旗悦版 1.5T https://www.autohome.com.cn/spec/46112/#pvareaid=3454492 2020款 4410 4 中型车 https://car2.autoimg.cn/cardfs/product/g3/M04/92/40/380x285_0_q87_autohomecar__ChsEkV8G1BiAFN2JAAlzGHoYv9M868.jpg 19.08万 14.58万 14.58-19.08万 https://www.autohome.com.cn/4410/price.html#pvareaid=101446 红旗H5 https://www.autohome.com.cn/4410/#levelsource=000000000_0&pvareaid=101594 {}
转载请注明:在路上 » 【已解决】车型车系数据缺失如红旗H5等部分车型数据