【问题】
C#中,想要去除html的标签tag,且同时去除注释comment。
【解决过程】
1.参考:
How can I strip HTML tags from a string in ASP.NET?
去试试用:
public string htmlRemoveTag(string html) { HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(html); if (htmlDoc == null) { return ""; } string filteredHtml = ""; foreach (var node in htmlDoc.DocumentNode.ChildNodes) { filteredHtml += node.InnerText; } return filteredHtml; }
结果是,可以去除所有的tag了。
但是对于html的注释:
<!——- A+ Content Begins Here ——-> <!——- BRAND LOGO ——-> <!——- TITLE ——-> Frigidaire Mini Air Conditioner <!——- GENERAL DESCRIPTION ——-> Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet. It quickly cools a room on hot days and quie。。。。。。。。 |
却没去掉。
2.继续去除comment。
参考:
然后用:
public string htmlRemoveTag(string html) { HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(html); if (htmlDoc == null) { return ""; } // 1. remove all comments //(1)get all comment nodes using XPATH foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()")) { //(2) remove comment node itself comment.ParentNode.RemoveChild(comment); } //2. get all content string filteredHtml = ""; foreach (var node in htmlDoc.DocumentNode.ChildNodes) { filteredHtml += node.InnerText; } return filteredHtml; }
就实现了目的,结果是html的内容,没有tag,没有comment:
” Frigidaire Mini Air Conditioner Frigidaire’s FRA052XT7 5,000 BTU 115-Volt Window-Mounted Mini-Compact Air Conditioner is perfect for rooms up to 150 square feet. It quickly cools a room on hot days and quiet operation keeps you cool without keeping you awake. This unit features mechanical rotary controls and top, full-width, 2-way air direction control. The antimicrobial mesh filter with side, slide-out access cleans the air removing harmful bacteria. Low voltage start-up conserves energy and saves you money 。。。。。。。。。。。。。。 |
【总结】
想要去除html的tag,并且不保留对应的comment,那么可以用:
using HtmlAgilityPack; public string htmlRemoveTag(string html) { HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(html); if (htmlDoc == null) { return ""; } // 1. remove all comments //(1)get all comment nodes using XPATH foreach (HtmlNode comment in htmlDoc.DocumentNode.SelectNodes("//comment()")) { //(2) remove comment node itself comment.ParentNode.RemoveChild(comment); } //2. get all content string filteredHtml = ""; foreach (var node in htmlDoc.DocumentNode.ChildNodes) { filteredHtml += node.InnerText; } return filteredHtml; }
转载请注明:在路上 » 【已解决】C#去除Html的tag且同时去除注释