最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【整理】Python中解码(decode)HTML中的实体(entity)+ 将name entity转为code point entity + 将code point entity转为name entity

Python crifan 7335浏览 0评论

【Python中解码(decode)HTML中的实体(entity)】

使用Python时,有时候会遇到需要处理HTML代码。

而HTML代码中,有时候会出现所谓的实体,英文叫做Entity。

HTML Entity,总体来说,分两类:

  • name entity:通过名字命名的实体,形式为&xxx;。比如©即对应着版权copyright的那个小标志:©。
    • 注意:这类(特殊)字符,往往在GBK等编码中,无法正常显示。所以,如果你想要把unicode的字符©在windows的cmd(默认为GBK编码)时,就只能看到"漏",而不是’©’了。当然,对应的,将unicode的"©"编码为UTF-8格式,通过logging输出到(UTF-8编码的)文件中,就可以看到正常显示出来的"©"了。
  • code point entity:通过此特殊字符所对应的Unicode的值,即成为Unicode code  point==code point==codepoint,中文翻译为码点。形式为 &#xxx;,其中xxx是数字,可以是十进制的,也可以是(以x开头的)十六进制的。比如上述所举例的 © == © == © == ©,都指的是’©’这个特殊字符。

此处,想要把HTML Entity,不论是name entity,还是codepoint entity,都转换为对应的特殊字符的话,偶在参考了一些资料后,最终整理出下面的函数,方便大家使用:

import re;

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;

def decodeHtmlEntity(origHtml, decodedEncoding=""):
    """Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty)
    eg: from © or © or © or © to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty
    
    Note:
    Some special char can NOT show in some encoding, such as ©  can NOT show in GBK

    Related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """
    decodedHtml = "";

    #A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1
    # 'zwnj': '‌',
    # 'aring': '\xe5',
    # 'gt': '>',
    # 'yen': '\xa5',
    #logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs);
    
    #A dictionary that maps HTML entity names to the Unicode codepoints
    # 'aring': 229,
    # 'gt': 62,
    # 'sup': 8835,
    # 'Ntilde': 209,
    #logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint);
    
    #A dictionary that maps Unicode codepoints to HTML entity names
    # 8704: 'forall',
    # 8194: 'ensp',
    # 8195: 'emsp',
    # 8709: 'empty',
    #logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name);

    #http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/
    decodedEntityName = re.sub('&(?P<entityName>[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml);
    #print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)= <type 'unicode'>
    decodedCodepointInt = re.sub('&#(?P<codePointInt>\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName);
    #print "decodedCodepointInt=",decodedCodepointInt;
    decodedCodepointHex = re.sub('&#x(?P<codePointHex>[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt);
    #print "decodedCodepointHex=",decodedCodepointHex;

    #logging.info("origHtml=%s", origHtml);
    decodedHtml = decodedCodepointHex;
    #logging.info("decodedHtml=%s", decodedHtml);
    
    if(decodedEncoding):
        # note: here decodedHtml is unicode
        decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore');
        #print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml);
        
    return decodedHtml;

 

【实现name entity和code point entity之间的互相转换】

而想要在name entity转换为code point entity,比如,从 &nbsp; 转换为 &#160;

或者是要把code point entity转换为name  entity的话,比如从 &#160; 转换为 &nbsp;

可以用下面对应的,我所整理出来的函数:

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;

#------------------------------------------------------------------------------
def htmlEntityNameToCodepoint(htmlWithEntityName):
    """Convert html's entity name into entity code point
    eg: from &nbsp; to &#160; 
    
    related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """

    # 'aring':  229,
    # 'gt':     62,
    # 'sup':    8835,
    # 'Ntilde': 209,
    
    # "&aring;":"&#229;",
    # "&gt":    "&#62;",
    # "&sup":   "&#8835;",
    # "&Ntilde":"&#209;",
    nameToCodepointDict = {};
    for eachName in htmlentitydefs.name2codepoint:
        fullName = "&" + eachName + ";";
        fullCodepoint = "&#" + str(htmlentitydefs.name2codepoint[eachName]) + ";";
        nameToCodepointDict[fullName] = fullCodepoint;

    #"&aring;" -> "&#229;"
    htmlWithCodepoint = htmlWithEntityName;
    for key in nameToCodepointDict.keys() :
        htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint);
    return htmlWithCodepoint;

#------------------------------------------------------------------------------
def htmlEntityCodepointToName(htmlWithCodepoint):
    """Convert html's entity code point into entity name
    eg: from &#160; to &nbsp;
    
    related knowledge:
    http://www.htmlhelp.com/reference/html40/entities/latin1.html
    http://www.htmlhelp.com/reference/html40/entities/special.html
    """
    # 8704: 'forall',
    # 8194: 'ensp',
    # 8195: 'emsp',
    # 8709: 'empty',
    
    # "&#8704;": "&forall;",
    # "&#8194;": "&ensp;",
    # "&#8195;": "&emsp;",
    # "&#8709;": "&empty;",
    codepointToNameDict = {};
    for eachCodepoint in htmlentitydefs.codepoint2name:
        fullCodepoint = "&#" + str(eachCodepoint) + ";";
        fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";";
        codepointToNameDict[fullCodepoint] = fullName;

    #"&#160;" -> "&nbsp;"
    htmlWithEntityName = htmlWithCodepoint;
    for key in codepointToNameDict.keys() :
        htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName);
    return htmlWithEntityName;

 

提示:

1. 想要了解更多的HTML Entity方面的内容的话,可以参考:

Latin-1 Entities

Special Entities

2.关于更多我所整理总结的Python方面的函数,可以去看:

crifan的Python库:crifanLib.py

3.当然,关于将html 的entity进行解码的话,可以参考附录1所总结的内容,使用HTMLParser中的unescape()

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('&copy; 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('&#169; 2010')
>>> s
u'\xa9 2010'

感兴趣的,自己去折腾吧。

 

【参考资料】

1. Decoding HTML Entities to Text in Python

2. Decode HTML entities in Python string?

转载请注明:在路上 » 【整理】Python中解码(decode)HTML中的实体(entity)+ 将name entity转为code point entity + 将code point entity转为name entity

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (1)

    87 queries in 0.206 seconds, using 22.08MB memory