【整理】Python中解码（decode）HTML中的实体（entity）+ 将name entity转为code point entity + 将code point entity转为name entity

【Python中解码（decode）HTML中的实体（entity）】

使用Python时，有时候会遇到需要处理HTML代码。

而HTML代码中，有时候会出现所谓的实体，英文叫做Entity。

HTML Entity，总体来说，分两类：

name entity：通过名字命名的实体，形式为&xxx;。比如©即对应着版权copyright的那个小标志：©。

注意：这类（特殊）字符，往往在GBK等编码中，无法正常显示。所以，如果你想要把unicode的字符©在windows的cmd（默认为GBK编码）时，就只能看到"漏"，而不是’©’了。当然，对应的，将unicode的"©"编码为UTF-8格式，通过logging输出到（UTF-8编码的）文件中，就可以看到正常显示出来的"©"了。

code point entity：通过此特殊字符所对应的Unicode的值，即成为Unicode code point==code point==codepoint，中文翻译为码点。形式为 &#xxx;，其中xxx是数字，可以是十进制的，也可以是（以x开头的）十六进制的。比如上述所举例的 © == © == © == ©，都指的是’©’这个特殊字符。

此处，想要把HTML Entity，不论是name entity，还是codepoint entity，都转换为对应的特殊字符的话，偶在参考了一些资料后，最终整理出下面的函数，方便大家使用：

import re;

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;

def decodeHtmlEntity(origHtml, decodedEncoding=""):
    """Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty)
    eg: from &copy; or &#169; or &#xa9; or &#xA9; to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty
    
    Note:
    Some special char can NOT show in some encoding, such as ©  can NOT show in GBK

    Related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """
    decodedHtml = "";

    #A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1
    # 'zwnj': '&#8204;',
    # 'aring': '\xe5',
    # 'gt': '>',
    # 'yen': '\xa5',
    #logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs);
    
    #A dictionary that maps HTML entity names to the Unicode codepoints
    # 'aring': 229,
    # 'gt': 62,
    # 'sup': 8835,
    # 'Ntilde': 209,
    #logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint);
    
    #A dictionary that maps Unicode codepoints to HTML entity names
    # 8704: 'forall',
    # 8194: 'ensp',
    # 8195: 'emsp',
    # 8709: 'empty',
    #logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name);

    #http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/
    decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml);
    #print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)= <type 'unicode'>
    decodedCodepointInt = re.sub('&#(?P<codePointInt>\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName);
    #print "decodedCodepointInt=",decodedCodepointInt;
    decodedCodepointHex = re.sub('&#x(?P<codePointHex>[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt);
    #print "decodedCodepointHex=",decodedCodepointHex;

    #logging.info("origHtml=%s", origHtml);
    decodedHtml = decodedCodepointHex;
    #logging.info("decodedHtml=%s", decodedHtml);
    
    if(decodedEncoding):
        # note: here decodedHtml is unicode
        decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore');
        #print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml);
        
    return decodedHtml;

【实现name entity和code point entity之间的互相转换】

而想要在name entity转换为code point entity，比如，从   转换为  

或者是要把code point entity转换为name entity的话，比如从   转换为  

可以用下面对应的，我所整理出来的函数：

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;

#------------------------------------------------------------------------------
def htmlEntityNameToCodepoint(htmlWithEntityName):
    """Convert html's entity name into entity code point
    eg: from &nbsp; to &#160; 
    
    related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """

    # 'aring':  229,
    # 'gt':     62,
    # 'sup':    8835,
    # 'Ntilde': 209,
    
    # "&aring;":"&#229;",
    # "&gt":    "&#62;",
    # "&sup":   "&#8835;",
    # "&Ntilde":"&#209;",
    nameToCodepointDict = {};
    for eachName in htmlentitydefs.name2codepoint:
        fullName = "&" + eachName + ";";
        fullCodepoint = "&#" + str(htmlentitydefs.name2codepoint[eachName]) + ";";
        nameToCodepointDict[fullName] = fullCodepoint;

    #"&aring;" -> "&#229;"
    htmlWithCodepoint = htmlWithEntityName;
    for key in nameToCodepointDict.keys() :
        htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint);
    return htmlWithCodepoint;

#------------------------------------------------------------------------------
def htmlEntityCodepointToName(htmlWithCodepoint):
    """Convert html's entity code point into entity name
    eg: from &#160; to &nbsp;
    
    related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """
    # 8704: 'forall',
    # 8194: 'ensp',
    # 8195: 'emsp',
    # 8709: 'empty',
    
    # "&#8704;": "&forall;",
    # "&#8194;": "&ensp;",
    # "&#8195;": "&emsp;",
    # "&#8709;": "&empty;",
    codepointToNameDict = {};
    for eachCodepoint in htmlentitydefs.codepoint2name:
        fullCodepoint = "&#" + str(eachCodepoint) + ";";
        fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";";
        codepointToNameDict[fullCodepoint] = fullName;

    #"&#160;" -> "&nbsp;"
    htmlWithEntityName = htmlWithCodepoint;
    for key in codepointToNameDict.keys() :
        htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName);
    return htmlWithEntityName;

提示：

1. 想要了解更多的HTML Entity方面的内容的话，可以参考：

Latin-1 Entities

Special Entities

2.关于更多我所整理总结的Python方面的函数，可以去看：

crifan的Python库：crifanLib.py

3.当然，关于将html 的entity进行解码的话，可以参考附录1所总结的内容，使用HTMLParser中的unescape()：

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('&copy; 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('&#169; 2010')
>>> s
u'\xa9 2010'

感兴趣的，自己去折腾吧。

【参考资料】

1. Decoding HTML Entities to Text in Python

2. Decode HTML entities in Python string?

转载请注明：在路上 » 【整理】Python中解码（decode）HTML中的实体（entity）+ 将name entity转为code point entity + 将code point entity转为name entity

Post Views: 2,097

【整理】Python中解码（decode）HTML中的实体（entity）+ 将name entity转为code point entity + 将code point entity转为name entity

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (1)