【背景】
折腾:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
期间,打算去试试使用PDFMiner去把PDF,且是个加了密,不可拷贝的PDF,看看能否转换为文本或HTML。
【折腾过程】
1.找到主页:
去:
https://pypi.python.org/pypi/pdfminer/
下载:
2.解压后去安装:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>python setup.py install running install running build running build_py creating build creating build\lib creating build\lib\pdfminer copying pdfminer\arcfour.py -> build\lib\pdfminer copying pdfminer\ascii85.py -> build\lib\pdfminer copying pdfminer\ccitt.py -> build\lib\pdfminer copying pdfminer\cmapdb.py -> build\lib\pdfminer copying pdfminer\converter.py -> build\lib\pdfminer copying pdfminer\encodingdb.py -> build\lib\pdfminer copying pdfminer\fontmetrics.py -> build\lib\pdfminer copying pdfminer\glyphlist.py -> build\lib\pdfminer copying pdfminer\image.py -> build\lib\pdfminer copying pdfminer\latin_enc.py -> build\lib\pdfminer copying pdfminer\layout.py -> build\lib\pdfminer copying pdfminer\lzw.py -> build\lib\pdfminer copying pdfminer\pdfcolor.py -> build\lib\pdfminer copying pdfminer\pdfdevice.py -> build\lib\pdfminer copying pdfminer\pdfdocument.py -> build\lib\pdfminer copying pdfminer\pdffont.py -> build\lib\pdfminer copying pdfminer\pdfinterp.py -> build\lib\pdfminer copying pdfminer\pdfpage.py -> build\lib\pdfminer copying pdfminer\pdfparser.py -> build\lib\pdfminer copying pdfminer\pdftypes.py -> build\lib\pdfminer copying pdfminer\psparser.py -> build\lib\pdfminer copying pdfminer\rijndael.py -> build\lib\pdfminer copying pdfminer\runlength.py -> build\lib\pdfminer copying pdfminer\utils.py -> build\lib\pdfminer copying pdfminer\__init__.py -> build\lib\pdfminer running build_scripts creating build\scripts-2.7 copying and adjusting tools\pdf2txt.py -> build\scripts-2.7 copying and adjusting tools\dumppdf.py -> build\scripts-2.7 copying and adjusting tools\latin2ascii.py -> build\scripts-2.7 running install_lib creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\arcfour.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\ascii85.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\ccitt.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\cmapdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\converter.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\encodingdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\fontmetrics.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\glyphlist.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\image.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\latin_enc.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\layout.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\lzw.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfcolor.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfdevice.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfdocument.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdffont.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfinterp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfpage.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdfparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\pdftypes.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\psparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\rijndael.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\runlength.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer copying build\lib\pdfminer\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\arcfour.py to arcfour.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ascii85.py to ascii85.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ccitt.py to ccitt.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\cmapdb.py to cmapdb.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\converter.py to converter.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\encodingdb.py to encodingdb.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\fontmetrics.py to fontmetrics.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\glyphlist.py to glyphlist.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\image.py to image.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\latin_enc.py to latin_enc.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\layout.py to layout.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\lzw.py to lzw.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfcolor.py to pdfcolor.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdevice.py to pdfdevice.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdocument.py to pdfdocument.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdffont.py to pdffont.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfinterp.py to pdfinterp.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfpage.py to pdfpage.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfparser.py to pdfparser.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdftypes.py to pdftypes.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\psparser.py to psparser.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\rijndael.py to rijndael.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\runlength.py to runlength.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\utils.py to utils.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\__init__.py to __init__.pyc running install_scripts copying build\scripts-2.7\dumppdf.py -> D:\tmp\dev_install_root\Python27_x64\Scripts copying build\scripts-2.7\latin2ascii.py -> D:\tmp\dev_install_root\Python27_x64\Scripts copying build\scripts-2.7\pdf2txt.py -> D:\tmp\dev_install_root\Python27_x64\Scripts running install_egg_info Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer-20131113-py2.7.egg-info D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>
然后再去试试。
然后在:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools
中找到:
pdf2txt.py
然后去试试:
3.结果竟然出错:
D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>ls spec183r21.0.pdf xml D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd - The system cannot find the path specified. D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113 D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>ls Makefile PKG-INFO build cmaprsrc docs pdfminer samples setup.py tools D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>cd tools D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>ls Makefile conv_afm.py conv_cmap.py conv_glyphlist.py dumppdf.py latin2ascii.py pdf2html.cgi pdf2txt.py prof.py runapp.py D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf\spec183 r21.0.pdf Traceback (most recent call last): File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main caching=caching, check_extractable=True): File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages doc = PDFDocument(parser, caching=caching) File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__ xref.load(parser) File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load objid1 = objs[index*2] IndexError: list index out of range D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>
4.加了-t参数,也还是不行:
D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -t html -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf \spec183r21.0.pdf Traceback (most recent call last): File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module> if __name__ == '__main__': sys.exit(main(sys.argv)) File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main caching=caching, check_extractable=True): File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages doc = PDFDocument(parser, caching=caching) File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__ xref.load(parser) File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load objid1 = objs[index*2] IndexError: list index out of range
5.然后再去试试,看看能否用PDFMiner去解密,结果没有找到这些选项。。。
【总结】
最终放弃使用PDFMiner,暂时由于该程序有bug,无法用其将pdf转换为html或文本。