最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【已解决】Python中unquote乱码的问题

Python crifan 23177浏览 0评论

【背景】

Python中,已经通过代码,获得了对应的dict类型的变量了。

其中对应的评论内容部分是:

  "cms_body" : "%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E",

并且也是通过:

unquotedCmsBody = urllib.unquote(mainCmtDataDict['cms_body']);

去获得对应的内容,但是结果还是原始的,带%号的那种内容,不过其中的其他英文部分,是unquote了:

npp change to utf8 for python unquote

而余下的,中文部分的内容,还始终是乱码,无法显示。

【解决过程】

1.自己去通过IDLE中去试了试,结果是:

python shell 2.7.3 urllib unquote

以为自己对Python中的编码已经很懂了,结果遇到此处的问题,也还是感觉有点一头雾水,还是解决不了。

2.搜了半天的资料了,尝试了半天,都还是没解决。

3.最后是参考:

python中unicode、utf8、gbk等编码问题–Code Monkey·程序猿,python/php/iOS程序员博客。

的:

特别注意:utf8编码、gbk编码的原型加上u然后再转unicode是错误写法,肯定转不了,那怎样去掉u呢?str()函数也不能直接转,只好把u’%E9%95%BF%E6%98%A5%E5%B8%82’用str()处理去掉u,然后一切都OK了。

>>> urllib.unquote(str(s)).decode(‘utf8’)

u'\u957f\u6625\u5e02'
>>> print urllib.unquote(str(s)).decode('utf8')
长春市

使用:

        cmdBodyStr = str(mainCmtDataDict['cms_body']);
        logging.info("cmdBodyStr=%s", cmdBodyStr);
        urldecodedCmsBody = urllib.unquote(cmdBodyStr);
        logging.info("urldecodedCmsBody=%s", urldecodedCmsBody);

才解决了问题:

LINE 754 INFO cmdBodyStr=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E

LINE 756 INFO urldecodedCmsBody=阿杰姐妹继续泪流满面……(*^__^*)&nbsp;这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" />

LINE 758 INFO curType=<type ‘str’>

整理了之后,使用代码:

        cmsBodyUni = mainCmtDataDict['cms_body'];
        logging.info("cmsBodyUni=%s", cmsBodyUni);
        curType = type(cmsBodyUni);
        logging.info("curType=%s", curType);
        cmdBodyStr = str(cmsBodyUni);
        logging.info("cmdBodyStr=%s", cmdBodyStr);
        curType = type(cmdBodyStr);
        logging.info("curType=%s", curType);
        urlunquotedCmsBodyStr = urllib.unquote(cmdBodyStr);
        logging.info("urlunquotedCmsBodyStr=%s", urlunquotedCmsBodyStr);
        curType = type(urlunquotedCmsBodyStr);
        logging.info("curType=%s", curType);
        urlunquotedCmsBodyUni = urlunquotedCmsBodyStr.decode("UTF-8");
        curType = type(urlunquotedCmsBodyUni);
        logging.info("curType=%s", curType);
        logging.info("urlunquotedCmsBodyUni=%s", urlunquotedCmsBodyUni);

输出结果:

LINE 755 INFO cmsBodyUni=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E

LINE 757 INFO curType=<type ‘unicode’>

LINE 759 INFO cmdBodyStr=%E9%98%BF%E6%9D%B0%E5%A7%90%E5%A6%B9%E7%BB%A7%E7%BB%AD%E6%B3%AA%E6%B5%81%E6%BB%A1%E9%9D%A2%E2%80%A6%E2%80%A6%28%2A%5E__%5E%2A%29%26nbsp%3B%E8%BF%99%E7%A7%8D%E8%A1%A8%E8%BE%BE%E5%A5%BD%E5%96%9C%E5%89%A7%E5%93%A6%EF%BC%81%E4%BA%8C%E5%85%AC%E4%B8%BB%E5%A4%A7%E4%BA%BA%E6%88%90%E7%BB%A9%E4%B8%8D%E9%94%99%E5%93%A6%EF%BC%81%E4%B9%88%E4%B9%88%E5%93%92%3Cimg%20src%3D%22http%3A%2F%2Fwww.sinaimg.cn%2Fuc%2Fmyshow%2Fblog%2Fmisc%2Fgif%2FE___0088EN00SIGT.gif%22%20style%3D%22margin%3A1px%3Bcursor%3Apointer%3B%22%20onclick%3D%22window.open%28%27http%3A%2F%2Fblog.sina.com.cn%2Fmyshow2010%27%29%22%20border%3D%220%22%20title%3D%22%E9%A1%B6%22%20%2F%3E

LINE 761 INFO curType=<type ‘str’>

LINE 763 INFO urlunquotedCmsBodyStr=阿杰姐妹继续泪流满面……(*^__^*)&nbsp;这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" />

LINE 765 INFO curType=<type ‘str’>

LINE 768 INFO curType=<type ‘unicode’>

LINE 769 INFO urlunquotedCmsBodyUni=阿杰姐妹继续泪流满面……(*^__^*)&nbsp;这种表达好喜剧哦!二公主大人成绩不错哦!么么哒<img src="http://www.sinaimg.cn/uc/myshow/blog/misc/gif/E___0088EN00SIGT.gif" style="margin:1px;cursor:pointer;" onclick="window.open(‘http://blog.sina.com.cn/myshow2010’)" border="0" title="顶" />

python str then urllib unquote

 

【总结】

此处是,已经获得了,Unicode类型的字符串,字符串内容是,quote过的,带百分号%的,比如:

%E8%BD%AC%E5%8F%91%E5%BE%AE%E5%8D%9A

而此处,想要获得对应的中文内容,则需要:

1.先去把当前的unicode字符串转换为普通的str

quotedStringStrType= str(quotedStringUnicodeType)

2.再去通过urllib.unquote去解码,得到真正的中文内容

urlunquotedOriginalStr = urllib.unquote(quotedStringStrType); 

注意:

此处的最终解码得到的字符串是UTF-8编码的。

转载请注明:在路上 » 【已解决】Python中unquote乱码的问题

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
89 queries in 0.226 seconds, using 22.17MB memory