【问题】
C#中,中HtmlAgilityPack,去解析:
http://www.amazon.com/Kindle-Fire-HD/dp/B0083PWAPW/ref=lp_1055398_1_2?ie=UTF8&qid=1369721900&sr=1-2
的html中的:
World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini) |
时,发现对应的源码是:
<span>World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (<a href="#" id="kpp-popover-0">compared to the iPad mini</a><script type="text/javascript"> + ‘<img src="http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif"/>’ }); </script>)</span> |
然后用HtmlAgilityPack解析后,结果发现其中的InnerText却是:
World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini\namznJQ.available(‘jQuery’, function() { \n(function ($) {\namznJQ.available(‘popover’, function() {\n\tvar content = ‘<h2 style=\"font-size: 17px;\">Two Antennas, Better Bandwidth</h2>’ \n\n\t+ ‘<img src=\"http://g-ec2.images-amazon.com/images/G/01/kindle/dp/2012/KT/tate_feature-wifi._V395653267_.gif\"/>’\n\t\n\t$(‘#kpp-popover-0’).amazonPopoverTrigger({\n\t\tliteralContent: content,\n\t\tcloseText: ‘Close’,\n\t\ttitle: ‘ ’,\n\t\twidth: 550,\n\t\tlocation: ‘centered’\n\t});\n\n});\n}(jQuery)); \n}); \n\n) |
而不是所希望的:
World’s first tablet with dual-band, dual-antenna Wi-Fi for over 35% faster downloads and streaming (compared to the iPad mini) |
即,需要去除InnerText中的Javascript。
【解决过程】
1.参考之前就看过的:
向HtmlAgilityPack道歉:解析HTML还是你好用
和对应的:
C#: HtmlAgilityPack extract inner text
然后调试了半天,最终用:
//remove sub node from current html node //eg: //"script" //for //<script type="text/javascript"> public HtmlNode removeSubHtmlNode(HtmlNode curHtmlNode, string subNodeToRemove) { HtmlNode afterRemoved = curHtmlNode; HtmlNodeCollection foundAllSub = curHtmlNode.SelectNodes(subNodeToRemove); if ((foundAllSub!= null ) && (foundAllSub.Count > 0)) { foreach (HtmlNode subNode in foundAllSub) { curHtmlNode.RemoveChild(subNode); } } //foreach (var subNode in afterRemoved.Descendants(subNodeToRemove)) //{ // //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll // //Additional information: Collection was modified; enumeration operation may not execute. // afterRemoved.RemoveChild(subNode); // curHtmlNode.RemoveChild(subNode); // //subNode.Remove(); //} return afterRemoved; } HtmlNode curBulletNode = allBulletNodeList[idx]; HtmlNode noJsNode = crl.removeSubHtmlNode(curBulletNode, "script"); HtmlNode noStyleNode = crl.removeSubHtmlNode(curBulletNode, "style"); string bulletStr = noStyleNode.InnerText;
而解决了问题。
其中可以看出:
1.那人给出的例子中,用
htmlDoc.DocumentNode.Descendants("script")
找到子节点,然后用
script.Remove();
去删除,是可以的。
2.但是此处如果用,当前的Html节点,做类似的处理:
foreach (var subNode in afterRemoved.Descendants(subNodeToRemove)) { //An unhandled exception of type 'System.InvalidOperationException' occurred in mscorlib.dll //Additional information: Collection was modified; enumeration operation may not execute. afterRemoved.RemoveChild(subNode); curHtmlNode.RemoveChild(subNode); //subNode.Remove(); }
就会出现注释中提示的错误:
Additional information: Collection was modified; enumeration operation may not execute.
即,在枚举Collection中,删除其中的值,是不允许的。
所以才想了别的办法去实现类似的remove的效果的。
【总结】
实现类似的删除的效果,真的是累屎了。。。。
删除根节点其下的子节点,好删;
删除当前某个节点下的节点,难删。(后来调试中,发现,其实执行subNode.Remove(); 时,已经删除成功了,但是接着还是会去执行foreach循环,导致报错的。。。)
转载请注明:在路上 » 【已解决】C#用HtmlAgilityPack执行Html解析时,发现InnerText中包含javascript,要去除Javascript