折腾:
【未解决】php中用html解析库去解析处理印象笔记的html源码
期间,再去换别的库试试
PHP Simple HTML DOM Parser – Browse Files at SourceForge.net
PHP Simple HTML DOM Parser – Browse /simplehtmldom/1.9 at SourceForge.net
下载后里面有example
去试试
期间参考
Attribute Filters
[attribute*=value]
Matches elements that have the specified attribute and it contains a certain value.
再去参考:
Extract contents from
echo file_get_html(‘http://www.google.com/’)->plaintext;
用代码:
<?php include_once('./simple_html_dom.php'); $originEvernoteHtml = '<div><br /></div><div>此处包含要测试的内容,包括code代码:</div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902);-en-codeblock:true;"><div><span style="font-size: 12px; font-family: Monaco;">some code include</span></div><div><span style="font-size: 12px; font-family: Monaco;">little <</span></div><div><span style="font-size: 12px; font-family: Monaco;">greater ></span></div><div><span style="font-size: 12px; font-family: Monaco;">at &</span></div><div><span style="font-size: 12px; font-family: Monaco;">和其他字符</span></div></div><div>希望同步后,不要:</div><div>有多余的code</div><div>html字符不要被转义</div><div><br /></div><div>另外再去看看,之前出bug的代码</div><div>好像是中间包含多个空行?的代码</div><div style="box-sizing: border-box; padding: 8px; font-family: Monaco, Menlo, Consolas, "Courier New", monospace; font-size: 12px; color: rgb(51, 51, 51); border-top-left-radius: 4px; border-top-right-radius: 4px; border-bottom-right-radius: 4px; border-bottom-left-radius: 4px; background-color: rgb(251, 250, 248); border: 1px solid rgba(0, 0, 0, 0.14902);-en-codeblock:true;"><div># Author: Crifan Li</div><div># Function: Batch make for all gitbooks</div><div># Version: 20190716</div><div>#</div><div># [Note]</div><div># 1. this makefile should be located in</div><div># /Users/crifan/dev/dev_root/gitbook/gitbook_src_root/common</div><div><div><br /></div><div><br /></div></div><div><div>SUB_BOOKS=$(shell ls ../books)</div><div><br /></div></div><div><div>BOOKS_SRC_ROOT=$(shell cd ../books && pwd)</div><div><br /></div></div><div><div><br /></div><div><br /></div></div><div># Batch make for all gitbooks</div><div><div>help debug_dir init sync_content clean_all website pdf epub mobi all upload commit deploy:</div><div><br /></div></div><div> @echo "Current path="`pwd`;</div><div> @echo "LS_OUTPUT="$(SUB_BOOKS);</div><div> @echo "BOOKS_SRC_ROOT="$(BOOKS_SRC_ROOT);</div><div><div> @for each_item in $(SUB_BOOKS); \</div><div><br /></div></div><div><div> do \</div><div><br /></div></div><div><div> if [ -d $(BOOKS_SRC_ROOT)/$$each_item ]; then \</div><div><br /></div></div><div><div> cd $(BOOKS_SRC_ROOT)/$$each_item; \</div><div><br /></div></div><div><div> echo `pwd`; \</div><div><br /></div></div><div><div> if [ -f Makefile ]; then \</div><div><br /></div></div><div><div> make $@ || exit "$$?"; \</div><div><br /></div></div><div><div> fi; \</div><div><br /></div></div><div><div> cd ..; \</div><div><br /></div></div><div><div> fi; \</div><div><br /></div></div><div> done;</div></div><div>看看效果</div><div><br /></div>'; // $originEvernoteHtml = "<div>" . $originEvernoteHtml . "</div>"; // $originEvernoteHtml = "<html><head><title>parse evernote html</title></head><body>" . $originEvernoteHtml . "</body></html>"; $html = str_get_html($originEvernoteHtml); // print $html; $codeBlockList = $html->find('div[style*="en-codeblock"]'); foreach($codeBlockList as $codeBlockHtml){ // print $codeBlockHtml; $codeBlockStr = $codeBlockHtml->save(); print $codeBlockStr; } ?>
真的可以搜索到两个code block:
输出到网页的效果:
是2个代码段
不错。
那继续去调试
打印出html的str
尤其是:把div code block换成pre
以及去掉div内部的嵌套的div
结果发现是,对于
<div><span style的代码,没有换行:
而对于下面的div中的代码块,倒是换行了:
那抽空再去看看其他代码块,是不是正常情况下都可以换行
以及如何确保这个span style的div,也能保持换行