【背景】
Python中,已经通过代码,获得了对应的dict类型的变量了。
其中对应的评论内容部分是:
"cms_body" : "%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E", |
并且也是通过:
unquotedCmsBody = urllib.unquote(mainCmtDataDict['cms_body']);
去获得对应的内容,但是结果还是原始的,带%号的那种内容,不过其中的其他英文部分,是unquote了:
而余下的,中文部分的内容,还始终是乱码,无法显示。
【解决过程】
1.自己去通过IDLE中去试了试,结果是:
以为自己对Python中的编码已经很懂了,结果遇到此处的问题,也还是感觉有点一头雾水,还是解决不了。
2.搜了半天的资料了,尝试了半天,都还是没解决。
3.最后是参考:
python中unicode、utf8、gbk等编码问题–Code Monkey·程序猿,python/php/iOS程序员博客。
的:
“
特别注意:utf8编码、gbk编码的原型加上u然后再转unicode是错误写法,肯定转不了,那怎样去掉u呢?str()函数也不能直接转,只好把u’%E9%95%BF%E6%98%A5%E5%B8%82’用str()处理去掉u,然后一切都OK了。
>>> urllib.unquote(str(s)).decode(‘utf8’)
|
使用:
cmdBodyStr = str(mainCmtDataDict['cms_body']); logging.info("cmdBodyStr=%s", cmdBodyStr); urldecodedCmsBody = urllib.unquote(cmdBodyStr); logging.info("urldecodedCmsBody=%s", urldecodedCmsBody);
才解决了问题:
LINE 754 INFO cmdBodyStr=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E LINE 756 INFO urldecodedCmsBody=阿杰姐妹继续泪流满面……(*^__^*) 这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" /> LINE 758 INFO curType=<type ‘str’> |
整理了之后,使用代码:
cmsBodyUni = mainCmtDataDict['cms_body']; logging.info("cmsBodyUni=%s", cmsBodyUni); curType = type(cmsBodyUni); logging.info("curType=%s", curType); cmdBodyStr = str(cmsBodyUni); logging.info("cmdBodyStr=%s", cmdBodyStr); curType = type(cmdBodyStr); logging.info("curType=%s", curType); urlunquotedCmsBodyStr = urllib.unquote(cmdBodyStr); logging.info("urlunquotedCmsBodyStr=%s", urlunquotedCmsBodyStr); curType = type(urlunquotedCmsBodyStr); logging.info("curType=%s", curType); urlunquotedCmsBodyUni = urlunquotedCmsBodyStr.decode("UTF-8"); curType = type(urlunquotedCmsBodyUni); logging.info("curType=%s", curType); logging.info("urlunquotedCmsBodyUni=%s", urlunquotedCmsBodyUni);
输出结果:
LINE 755 INFO cmsBodyUni=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E LINE 757 INFO curType=<type ‘unicode’> LINE 759 INFO cmdBodyStr=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E LINE 761 INFO curType=<type ‘str’> LINE 763 INFO urlunquotedCmsBodyStr=阿杰姐妹继续泪流满面……(*^__^*) 这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" /> LINE 765 INFO curType=<type ‘str’> LINE 768 INFO curType=<type ‘unicode’> LINE 769 INFO urlunquotedCmsBodyUni=阿杰姐妹继续泪流满面……(*^__^*) 这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" /> |
【总结】
此处是,已经获得了,Unicode类型的字符串,字符串内容是,quote过的,带百分号%的,比如:
%E8%BD%AC%E5%8F%91%E5%BE%AE%E5%8D%9A
而此处,想要获得对应的中文内容,则需要:
1.先去把当前的unicode字符串转换为普通的str
quotedStringStrType= str(quotedStringUnicodeType)
2.再去通过urllib.unquote去解码,得到真正的中文内容
urlunquotedOriginalStr = urllib.unquote(quotedStringStrType);
注意:
此处的最终解码得到的字符串是UTF-8编码的。
转载请注明:在路上 » 【已解决】Python中unquote乱码的问题