在:
【记录】分析xxxapp中数据来源和如何爬取-1
和
【记录】分析xxxapp中数据来源和如何爬取-2
之后,去整理爬取数据的逻辑。
此处开始整理爬取数据的思路:
爬取的逻辑:
先去获取category=all的get_course_list
https://childapi.xxx.com/course/get_course_list?sign=4e89d3ae91d6e5c2a05908be477e913d&uid=0&sort=new&level=all&nature_style=all&start=0&category_id=1&auth_token=0&ishow=0&nature_area=all&rows=10&nature_id=all &stamp=1536819997
然后每次rows=10,每次增加10个,start=0,10,20等等
去获取所有的category的list,其中包含category的id
最多实测有36000+个category id
然后针对每个category去获取详情:
https://childapi.xxx.com/course/detail_new?course_id=61288
保存mp4视频,mp3音频,srt字幕,以及json详情信息
针对每个course,去获取:
已配音小伙伴的
https://childapi.xxx.com/course/last_show_peoples?sign=da0a90e3c7ee2f7656cbae6575c3521d&stamp=1536895345&uid=0&auth_token=0&course_id=59901&start=0&rows=20
和:
点赞榜的
https://childapi.xxx.com/StudyShow/course_show?sign=da0a90e3c7ee2f7656cbae6575c3521d&stamp=1536895345&uid=0&auth_token=0&course_id=59901&start=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[]
播放榜的
https://childapi.xxx.com/StudyShow/viewTop?sign=4ccb35d6f3b4cfecc81c9b02462c5604&stamp=1536895896&uid=0&auth_token=0&course_id=59901&start=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[]
另外(为了获取更多用户,则)单独再去:
底部Tab 排行榜-》学霸
的
https://childapi.xxx.com/top/sign_top?sign=17c411d4c4d011c614b7cef633fc68ce&stamp=1536909875&uid=0&auth_token=0&area_id=0&start=40&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[],实测发现不超过200个
底部Tab 排行榜-》人气
https://childapi.xxx.com/top/shownews_top_redis?sign=f00a8402d9ea4f26a0eee153aba71fb3&stamp=1536910053&uid=0&time_type=1&area_id=0&ranking_type=0&start=40&auth_token=0&rows=20
其中可以继续start=20的倍数,rows=20,继续获取,知道返回data是空列表[],实测发现不超过200个
对于每个用户:
获取详情是:
https://childapi.xxx.com/member?sign=a5d2f9bd3f98214ab3cf8de47e14833c&stamp=1536917132&uid=0&auth_token=0&member_id=4864238
然后再去后去每个用户的作品列表:
https://childapi.xxx.com/member/show_list?sign=9fbde32314aff6b77be9b64b2464f78a×tamp=1536917132&uid=0&auth_token=0&start=0&member_id=4864238&rows=20
对于每个用户的作品,都是一个show,则再去获取show的详情:
https://childapi.xxx.com/show/detail?show_id=60377593
另外,为了获取更多其他用户,对于每个用户再去获取对应关注和粉丝:
https://childapi.xxx.com/member/follows?auth_token=0&member_id=32527428&rows=20&start=0×tamp=1536919513254&uid=0&sign=e56d81d7373be0b53ecc8fe16cb2df2c
和:
https://childapi.xxx.com/member/fans?auth_token=0&member_id=13494467&rows=20&start=0×tamp=1536919109752&uid=0&sign=4be774e38c0619529b1c7556011fd9ba
另外再为了尽量扩大,获取更多的course,user,show,所以在可能的情况下,尽量也去通过得到的id,获得相关的内容
比如:
show的detail中,获取user和course
–>>
保存数据结果的形式:
- course
- 61288
- course_61288_video.mp4
- course_61288_audio.mp3
- course_61288_subtitle.srt
- course_61288_info.json
- 保存course的信息
- course_61288_shows.json
- 保存每个course对应有多少个shows
- 后续获取show的详情时,同步更新保存进来
- 48050
- …
- user
- 4864238
- show
- 144138897
- show_144138897_video.mp4
- show_144138897_info.json
- 保存了该show的详细信息
- 135839205
- …
- user_4864238_info.json
- 保存了该用户的详细信息
注:
举例
(1)course_61288_info.json
{ "id": "61288", "title": "我们有很多共同点", "description": "麦洛找了很多它和小猫的共同点,试图让小猫不那么害羞,可是小猫并不吃这一套。", "video": "", "video_srt": " ", "audio": " ", "pic": " https://img.xxx.cn/2018-09-10/5b9634a3d742d.jpg ", "if_subtitle": "0", "dif_level": "3", "subtitle_en": " https://cdn2.xxx.cn/2018-09-10/15365514894833.srt ", "subtitle_num": "6", "category_id": "10", "shows": "353", "views": "2270", "editor": "xxxMargaret", "editor_uid": "0", "tag": "小兔子麦洛,野猫", "status": "1", "create_time": "2016-01-26 11:38", "isalbum": "0", "top": "0", "ifshow": "1", "update_time": "1536658200", "sort": "-249308", "check": "1", "copyright": "1", "is_vip": "0", "show_peoples": "350", "score_peoples": "38", "is_score": "2", "is_needbuy": "0", "duration": "28", "bookOriginalId": "0", "permit_client": "2,4,1,3", "assign_times": "0", "copy": "视频片段摘自:“Milo and the wild cat _ Cartoon for kids”,本视频仅供免费学习使用!如需观看完整版,请支持正版!", "redirect": { "title": "", "url": "", "sort": "0" }, "editors": [ { "title": "上传", "nickname": "xxxMargaret", "uid": 0 }, { "title": "听译", "nickname": "妞妞的蛋卷", "uid": 0 }, { "title": "审校", "nickname": "酥酥Jesus", "uid": 0 }, { "title": "制作", "nickname": "徐婷", "uid": 0 } ], "share_talk": "", "share_pic": " https://img.xxx.cn/2018-09-10/5b9634a3d742d.jpg ", "share_url": " https://child.xxx.cn/index.php?m=home&c=Activity&a=childshare_video&course=MDAwMDAwMDAwMLGdpquCe8yh ", "score_type": 4, "score_weight": [ { "low": "0", "height": "55", "weight": "1.20" }, { "low": "56", "height": "70", "weight": "1.15" }, { "low": "71", "height": "80", "weight": "1.10" }, { "low": "81", "height": "90", "weight": "1.05" }, { "low": "91", "height": "100", "weight": "1.00" } ], "album_id": null, "is_strate": "0", "if_strate_buy": "0", "strate_audio_id": "0", "category": "今日更新", "nature": "", "album_title": "", "share_title": "我发现了一个超有意思的今日更新片段", "share_desc": "《我们有很多共同点》,快来围观吧", "share_friend": "我发现了一个超有意思的今日更新片段《我们有很多共同点》", "skip_url": "", "strate_url": "", "strate_isbuy": 0, "strate_pic": " https://img.xxx.cn/strate_detail.png ", "album_isbuy": 0, "feedback_url": " https://child.xxx.cn/home/basic/course_feedback?uid=0&course_id=61288 ", "video_adver": [], "course_adver": [ { "id": "6427", "title": "想和老外“无障碍交流”?来这里", "pic": " https://img.xxx.cn/2018-09-11/5b97cdc9be2b8.jpg ", "type": "custom", "son_type": "", "show_id": "0", "is_share": "0", "content": "", "scheme_url": "", "sub_title": "", "show_type": "0", "sort": "0", "shows": "10985", "views": "392", "weight": "10", "score_type": "2", "score": "2", "share_pic": " https://img.xxx.cn/2018-09-11/5b97cdc9be2b8.jpg ", "html": "", "clickreport": [], "displayreport": [], "url": " http://shaoer.xxx.com/basic/slider?adv=MDAwMDAwMDAwMLGdsquBrqKh " } ] }
(2)user_4864238_info.json
{ "id": "4864238", "uc_id": "8638661", "nickname": "夏竹", "avatar": " https://img.xxx.cn/2018-06-30/5b3734cb58bee.jpg ", "mobile": "", "app_type": "2", "version": "3.9.0", "push_info": "{\"comments\":\"1\"}", "signature": "其实皮一下很开心", "birthday": "2017-09-24", "sex": "2", "school": "1435652", "area": "4403", "type": "2", "status": "1", "reg_time": "1465926412", "fans": "973", "uid": "4864238", "follows": "1516", "views": "228", "photos": "71", "guestbooks": "0", "shows": "118", "words": "0", "collects": "4", "school_str": "h华x新小学", "is_black": "0", "support_collect": "4", "cover": " https://img.xxx.cn/cover_default.jpg ", "medal": [], "is_crown": "0", "search_id": "4927133", "ugcactive": { "icon": "", "title": "字幕组", "sub_title": "新活动", "url": " http://ugctest.xxx.cn/app/index/index? " }, "is_following": "0", "is_follow": "0", "follow_nickname": "", "user_number": "8862171", "score_time": "", "vip_endtime": "0", "is_vip": "0", "libu_level": "0", "claims_url": " https://child.xxx.cn/home/basic/report?type=3&tyid=0&uid=MDAwMDAwMDAwMLB0nm8&member_id=MDAwMDAwMDAwMLF3yGSBe67esaR0cg ", "dav": "", "dv_type": "0", "dv_status": "0", "libu_vip_endtime": "0", "libu_vip": "0", "libu_first": "0" }
(3)show_144138897_info.json
{ "id": "144138897", "uid": "4864238", "course_id": "56931", "album_id": "3353", "video": " ", "create_time": "2018-07-12 16:46", "score": "91", "comments": "0", "supports": "2", "views": "0", "diamonds": "0", "nickname": "夏竹", "avatar": "", "school": "1435652", "area": "4403", "birthday": "2017-09-24", "course_title": "你是谁呢", "pic": " https://img.xxx.cn/2018-07-03/5b3b28404a45b.jpg ", "permit_client": "2,4,1,3", "permit_show": "1", "is_crown": "0", "school_str": "", "dav": "", "dv_type": "0", "is_vip": "0" }
(4)course_61288_shows.json
{ "total": 1, "shows": [ { "id": "155469749", "uid": "2248187", "nickname": "Aegla", "course_id": "15159", "course_title": "不要低估他", "video": " ", } ] }
转载请注明:在路上 » 【记录】整理爬取趣配音app的数据的逻辑