【Python中解码(decode)HTML中的实体(entity)】
使用Python时,有时候会遇到需要处理HTML代码。
而HTML代码中,有时候会出现所谓的实体,英文叫做Entity。
HTML Entity,总体来说,分两类:
- name entity:通过名字命名的实体,形式为&xxx;。比如©即对应着版权copyright的那个小标志:©。
- 注意:这类(特殊)字符,往往在GBK等编码中,无法正常显示。所以,如果你想要把unicode的字符©在windows的cmd(默认为GBK编码)时,就只能看到"漏",而不是’©’了。当然,对应的,将unicode的"©"编码为UTF-8格式,通过logging输出到(UTF-8编码的)文件中,就可以看到正常显示出来的"©"了。
- code point entity:通过此特殊字符所对应的Unicode的值,即成为Unicode code point==code point==codepoint,中文翻译为码点。形式为 &#xxx;,其中xxx是数字,可以是十进制的,也可以是(以x开头的)十六进制的。比如上述所举例的 © == © == © == ©,都指的是’©’这个特殊字符。
此处,想要把HTML Entity,不论是name entity,还是codepoint entity,都转换为对应的特殊字符的话,偶在参考了一些资料后,最终整理出下面的函数,方便大家使用:
import re; #Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0. # so htmlentitydefs is only available between Python 2.3 and Python 2.7 import htmlentitydefs; def decodeHtmlEntity(origHtml, decodedEncoding=""): """Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty) eg: from © or © or © or © to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty Note: Some special char can NOT show in some encoding, such as © can NOT show in GBK Related knowledge: http://www.htmlhelp.com/reference/html40/entities/latin1.html http://www.htmlhelp.com/reference/html40/entities/special.html """ decodedHtml = ""; #A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1 # 'zwnj': '‌', # 'aring': '\xe5', # 'gt': '>', # 'yen': '\xa5', #logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs); #A dictionary that maps HTML entity names to the Unicode codepoints # 'aring': 229, # 'gt': 62, # 'sup': 8835, # 'Ntilde': 209, #logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint); #A dictionary that maps Unicode codepoints to HTML entity names # 8704: 'forall', # 8194: 'ensp', # 8195: 'emsp', # 8709: 'empty', #logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name); #http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/ decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml); #print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)= <type 'unicode'> decodedCodepointInt = re.sub('&#(?P<codePointInt>\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName); #print "decodedCodepointInt=",decodedCodepointInt; decodedCodepointHex = re.sub('&#x(?P<codePointHex>[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt); #print "decodedCodepointHex=",decodedCodepointHex; #logging.info("origHtml=%s", origHtml); decodedHtml = decodedCodepointHex; #logging.info("decodedHtml=%s", decodedHtml); if(decodedEncoding): # note: here decodedHtml is unicode decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore'); #print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml); return decodedHtml;
【实现name entity和code point entity之间的互相转换】
而想要在name entity转换为code point entity,比如,从 转换为  
或者是要把code point entity转换为name entity的话,比如从   转换为
可以用下面对应的,我所整理出来的函数:
#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0. # so htmlentitydefs is only available between Python 2.3 and Python 2.7 import htmlentitydefs; #------------------------------------------------------------------------------ def htmlEntityNameToCodepoint(htmlWithEntityName): """Convert html's entity name into entity code point eg: from to   related knowledge: http://www.htmlhelp.com/reference/html40/entities/latin1.html http://www.htmlhelp.com/reference/html40/entities/special.html """ # 'aring': 229, # 'gt': 62, # 'sup': 8835, # 'Ntilde': 209, # "å":"å", # ">": ">", # "&sup": "⊃", # "Ñ":"Ñ", nameToCodepointDict = {}; for eachName in htmlentitydefs.name2codepoint: fullName = "&" + eachName + ";"; fullCodepoint = "&#" + str(htmlentitydefs.name2codepoint[eachName]) + ";"; nameToCodepointDict[fullName] = fullCodepoint; #"å" -> "å" htmlWithCodepoint = htmlWithEntityName; for key in nameToCodepointDict.keys() : htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint); return htmlWithCodepoint; #------------------------------------------------------------------------------ def htmlEntityCodepointToName(htmlWithCodepoint): """Convert html's entity code point into entity name eg: from   to related knowledge: http://www.htmlhelp.com/reference/html40/entities/latin1.html http://www.htmlhelp.com/reference/html40/entities/special.html """ # 8704: 'forall', # 8194: 'ensp', # 8195: 'emsp', # 8709: 'empty', # "∀": "∀", # " ": " ", # " ": " ", # "∅": "∅", codepointToNameDict = {}; for eachCodepoint in htmlentitydefs.codepoint2name: fullCodepoint = "&#" + str(eachCodepoint) + ";"; fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";"; codepointToNameDict[fullCodepoint] = fullName; #" " -> " " htmlWithEntityName = htmlWithCodepoint; for key in codepointToNameDict.keys() : htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName); return htmlWithEntityName;
提示:
1. 想要了解更多的HTML Entity方面的内容的话,可以参考:
2.关于更多我所整理总结的Python方面的函数,可以去看:
3.当然,关于将html 的entity进行解码的话,可以参考附录1所总结的内容,使用HTMLParser中的unescape()
:
>>> import HTMLParser >>> h = HTMLParser.HTMLParser() >>> s = h.unescape('© 2010') >>> s u'\xa9 2010' >>> print s © 2010 >>> s = h.unescape('© 2010') >>> s u'\xa9 2010'
感兴趣的,自己去折腾吧。
【参考资料】
1. Decoding HTML Entities to Text in Python
2. Decode HTML entities in Python string?
转载请注明:在路上 » 【整理】Python中解码(decode)HTML中的实体(entity)+ 将name entity转为code point entity + 将code point entity转为name entity