折腾:
【记录】用PySpider去爬取scholastic的绘本书籍数据
期间,去真正Run批量爬取,结果看到输出的log中出错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | [I 181016 09 : 10 : 38 result_worker: 33 ] result ScholasticStorybook:b9571bf852d2a10a2f14a999e5f8c51b https: / / www.scholastic.com / content / scholastic / books2 / house - of - robots - by - chris - grabenstein - > { 'originUrl' : ' https: / / www.sch [E 181016 09 : 10 : 38 result_worker: 63 ] Object of type 'ObjectId' is not JSON serializable Traceback (most recent call last): File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/result/result_worker.py" , line 54 , in run self .on_result(task, result) File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/result/result_worker.py" , line 38 , in on_result result = result File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/database/sqlite/resultdb.py" , line 58 , in save return self ._replace(tablename, * * self ._stringify(obj)) File "/Users/crifan/.local/share/virtualenvs/crawler_scholastic_storybook-ttmbK5Yf/lib/python3.6/site-packages/pyspider/database/sqlite/resultdb.py" , line 44 , in _stringify data[ 'result' ] = json.dumps(data[ 'result' ]) File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py" , line 231 , in dumps return _default_encoder.encode(obj) File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py" , line 199 , in encode chunks = self .iterencode(o, _one_shot = True ) File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py" , line 257 , in iterencode return _iterencode(o, 0 ) File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py" , line 180 , in default o.__class__.__name__) TypeError: Object of type 'ObjectId' is not JSON serializable |

以为出错的url是:
后来发现不是。
然后以为是ObjectId,是之前接触到的MongoDB中的变量类型
所以去把代码改为:
1 2 3 4 5 6 7 8 9 10 | class ResultMongo( object ): ... def on_result( self , result): """save result to mongodb""" print ( "ResultMongo on_result: result=%s" % result) respResult = None if result: respResult = self .collection.insert(result) print ( "respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69 # return respResult |
即:不去保存和返回mongodb的insert返回的ObjectId类型的变量了。
结果问题依旧。
怀疑是:
1 2 3 4 5 6 7 8 | class Handler(BaseHandler): mongo = ResultMongo() print ( "mongo=%s" % mongo) 。。。 def on_result( self , result): print ( "PySpider on_result: result=%s" % result) self .mongo.on_result(result) # 执行插入数据的操作 super (Handler, self ).on_result(result) # 调用原有的数据存储 |
中
1 | super(Handler, self).on_result(result) |
的问题,都想要去注释掉呢,反正其实也用不到,数据都已保存到MongoDB了。
然后对于此处错误,开始以为只有一处出错,后来看log才发现:

是多处都出错了
-》证明不是某个url的问题
-》基本上确定就是:
1 | super(Handler, self).on_result(result) |
的问题
-》估计是json里面嵌套的值,此处PySPider中无法直接保存,而出错的。
-》打算去注释掉。
先不这么做,先去注释掉要保存的数据中的recommendations:

因为recommendations是对象的列表而不是普通变量类型的列表

-》而其他要保存的字段都是普通的类型,应该不会出现无法保存的问题。
然后再去运行看看:
如果没有出现此处问题,就说明之前猜测是对的。
竟然还是出错,问题依旧:

pyspider TypeError: Object of type ‘ObjectId’ is not JSON serializable
pyspider result_worker TypeError Object of type not JSON serializable
确定就是PySPider中的result_worker的报的错
且此处是不支持ObjectId
-》而ObjectId本身是pymongo的类型
-》所以要去搞清楚,此处保存的数据中,到底哪里包含了:
ObjectId
而此处感觉能和ObjectId有关系的,就只有这一处,所以强制转换为str吧:
1 2 3 4 5 6 7 8 9 10 11 | def on_result( self , result): """save result to mongodb""" print ( "ResultMongo on_result: result=%s" % result) respResult = None if result: respResult = self .collection.insert(result) print ( "type(respResult)=%s" % type (respResult)) respResult = str (respResult) print ( "type(respResult)=%s" % type (respResult)) print ( "respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69 return respResult |
先去调试确保输出是str:
1 2 3 | type (respResult)=<class 'bson.objectid.ObjectId' > type (respResult)=<class 'str' > respResult=5bc54e0bbfaa44fcce305d8d |
然后再去批量爬取,结果:
问题依旧。
去看Results:
果然是空的:

都没有保存成功。
感觉:
1 | super(Handler, self).on_result(result) # 调用原有的数据存储 |
的写法,难道有问题?
先去加上提调试代码:
1 2 | for eachValue in respDict.values(): print( "eachValue=%s, type(eachValue)=%s" , eachValue, type (eachValue)) |
看看保存数据的类型是否全是普通类型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | eachValue = https: / / www.scholastic.com / content / scholastic / books2 / house - of - robots - by - chris - grabenstein , type (eachValue) = < class 'str' > eachValue = https: / / www.scholastic.com / teachers / books / house - of - robots - by - chris - grabenstein / , type (eachValue) = < class 'str' > eachValue = House of Robots, type (eachValue) = < class 'str' > eachValue = https: / / www.scholastic.com / content5 / media / products / 49 / 9780545912549_mres .jpg , type (eachValue) = < class 'str' > eachValue = [ 'Chris Grabenstein' , 'James Patterson' ], type (eachValue) = < class 'list' > eachValue = [ 'Juliana Neufeld' ], type (eachValue) = < class 'list' > eachValue = , type (eachValue) = < class 'str' > eachValue = 0 , type (eachValue) = < class 'int' > eachValue = [ '3-5' , '6-8' ], type (eachValue) = < class 'list' > eachValue = T, type (eachValue) = < class 'str' > eachValue = 750L , type (eachValue) = < class 'str' > eachValue = , type (eachValue) = < class 'str' > eachValue = 50 , type (eachValue) = < class 'str' > eachValue = Fiction, type (eachValue) = < class 'str' > eachValue = It 's never been easy for Sammy Hayes-Rodriguez to fit in, so he' s dreading the day when his genius mom insists he bring her newest invention to school: a walking, talking robot he calls E - for "Error." Sammy 's no stranger to robots; his house is full of them. But this one not only thinks it' s Sammy 's brother; it' s actually even nerdier than Sammy. Will E be Sammy 's one-way ticket to Loserville? Or will he prove to the world that it' s cool to be square? It 's a roller-coaster ride for Sammy to discover the amazing secret E holds that could change his family forever, if all goes well on the trial run!, type(eachValue)=<class ' str '> eachValue = 336 , type (eachValue) = < class 'int' > eachValue = 9780545912549 , type (eachValue) = < class 'str' > eachValue = [ 'Fitting In' , 'Inventors and Inventions' , 'Middle School' , 'Siblings' ], type (eachValue) = < class 'list' > eachValue = [{ 'url' : ' https: / / www.scholastic.com / content / scholastic / books2 / my - sister - the - vampire - 11 - vampire - school - dropout - by - sienna - mer ', ' title ': ' Vampire School Dropout? '}, {' url ': ' https: / / www.scholastic.com / content / scholastic / books2 / middle - school - my - brother - is - a - big - fat - liar - by - james - patterson ', ' title ': ' My Brother Is a Big, Fat Liar '}, {' url ': ' https: / / www.scholastic.com / content / scholastic / books2 / candy - apple - 11 - the - sister - switch - by - jane - b - mason ', ' title ': ' The Sister Switch '}], type(eachValue)=<class ' list '> |
好像是没问题的:
除了recommendations外,
都是str或int,或str的list,都是普通变量,没有ObjectId
算了,还是:
要么注释掉:super(Handler, self).on_result(result) # 调用原有的数据存储
要么想办法找到正确的写法?
先去找找是否有更好的写法
pyspider result worker
pyspider result worker on_result
pyspider on_result
demo.pyspider.org 部署经验 | Binuxの杂货铺
“super(Handler, self).on_result(result)”
是作者自己这么写的,说明没问题。
Pyspider操作指南 | 思维之海
算了,去注释掉:
1 2 3 4 | def on_result(self, result): print( "PySpider on_result: result=%s" % result) self.mongo.on_result(result) # 执行插入数据的操作 # super(Handler, self).on_result(result) # 调用原有的数据存储 |
结果:
终于没有错误了,但是Results中也不会有数据保存了:


【总结】
此处PySpider保存的json字典中,没有特殊的值的类型,都是普通的str,int,str的list等,
尤其是没有(pymongo的)ObjectId
并且:(实际上是没关系,但是以防万一)我ResultMongo的on_result中的(真正是)ObjectId的变量,也去转为str了。
但是结果用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | from pyspider.libs.base_handler import * import re import json # import html import lxml from bs4 import BeautifulSoup from urllib.parse import quote_plus from pymongo import MongoClient class ResultMongo( object ): def __init__( self ): print ( "ResultMongo __init__" ) self .client = createMongoClient() print ( "self.client=%s" % self .client) self .db = self .client[MONGODB_DB_NAME] print ( "self.db=%s" % self .db) # self.db=Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'Scholastic') self .collection = self .db[MONGODB_COLLECTION_NAME] print ( "self.collection=%s" % self .collection) # self.collection=Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), 'Scholastic'), 'Storybook') def __del__( self ): print ( "ResultMongo __del__" ) self .client.close() def on_result( self , result): """save result to mongodb""" print ( "ResultMongo on_result: result=%s" % result) respResult = None if result: respResult = self .collection.insert(result) print ( "type(respResult)=%s" % type (respResult)) respResult = str (respResult) print ( "type(respResult)=%s" % type (respResult)) print ( "respResult=%s" % respResult) # respResult=5bc45fad7f4d3847b78e8c69 return respResult class Handler(BaseHandler): mongo = ResultMongo() print ( "mongo=%s" % mongo) # for debug for eachValue in respDict.values(): print ( "eachValue=%s, type(eachValue)=%s" % (eachValue, type (eachValue))) ... return respDict def on_result( self , result): print ( "PySpider on_result: result=%s" % result) self .mongo.on_result(result) # 执行插入数据的操作 super (Handler, self ).on_result(result) # 调用原有的数据存储 |
但是竟然竟然报错:
TypeError: Object of type ‘ObjectId’ is not JSON serializable
所以很是诡异。
最后没办法,只有去注释掉PySpider中的保存数据:
1 | # super(Handler, self).on_result(result) |
而规避此问题。
-》由此,当然PySpider中webui点击Results的话,也是看不到结果,是空的了。
TODO:
如果以后有机会和时间,再去深入研究,为何出现这么奇怪的问题,找到根本原因。
【后记1】
后来开始真正批量爬取时,又出现此错误了:

所以看来是其他方面的问题,不是此处的问题。
转载请注明:在路上 » 【部分解决】PySPider出错:TypeError: Object of type ‘ObjectId’ is not JSON serializable