最新消息:20210816 当前crifan.com域名已被污染,为防止失联,请关注(页面右下角的)公众号

【记录】尝试使用PDFMiner将不可复制的PDF转换为文本或HTML

工作和技术 crifan 3297浏览 0评论

【背景】

折腾:

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

期间,打算去试试使用PDFMiner去把PDF,且是个加了密,不可拷贝的PDF,看看能否转换为文本或HTML。

 

【折腾过程】

1.找到主页:

PDFMiner

去:

https://pypi.python.org/pypi/pdfminer/

下载:

pdfminer-20131113.tar.gz

2.解压后去安装:

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>python setup.py install
running install
running build
running build_py
creating build
creating build\lib
creating build\lib\pdfminer
copying pdfminer\arcfour.py -> build\lib\pdfminer
copying pdfminer\ascii85.py -> build\lib\pdfminer
copying pdfminer\ccitt.py -> build\lib\pdfminer
copying pdfminer\cmapdb.py -> build\lib\pdfminer
copying pdfminer\converter.py -> build\lib\pdfminer
copying pdfminer\encodingdb.py -> build\lib\pdfminer
copying pdfminer\fontmetrics.py -> build\lib\pdfminer
copying pdfminer\glyphlist.py -> build\lib\pdfminer
copying pdfminer\image.py -> build\lib\pdfminer
copying pdfminer\latin_enc.py -> build\lib\pdfminer
copying pdfminer\layout.py -> build\lib\pdfminer
copying pdfminer\lzw.py -> build\lib\pdfminer
copying pdfminer\pdfcolor.py -> build\lib\pdfminer
copying pdfminer\pdfdevice.py -> build\lib\pdfminer
copying pdfminer\pdfdocument.py -> build\lib\pdfminer
copying pdfminer\pdffont.py -> build\lib\pdfminer
copying pdfminer\pdfinterp.py -> build\lib\pdfminer
copying pdfminer\pdfpage.py -> build\lib\pdfminer
copying pdfminer\pdfparser.py -> build\lib\pdfminer
copying pdfminer\pdftypes.py -> build\lib\pdfminer
copying pdfminer\psparser.py -> build\lib\pdfminer
copying pdfminer\rijndael.py -> build\lib\pdfminer
copying pdfminer\runlength.py -> build\lib\pdfminer
copying pdfminer\utils.py -> build\lib\pdfminer
copying pdfminer\__init__.py -> build\lib\pdfminer
running build_scripts
creating build\scripts-2.7
copying and adjusting tools\pdf2txt.py -> build\scripts-2.7
copying and adjusting tools\dumppdf.py -> build\scripts-2.7
copying and adjusting tools\latin2ascii.py -> build\scripts-2.7
running install_lib
creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\arcfour.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\ascii85.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\ccitt.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\cmapdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\converter.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\encodingdb.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\fontmetrics.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\glyphlist.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\image.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\latin_enc.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\layout.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\lzw.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfcolor.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfdevice.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfdocument.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdffont.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfinterp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfpage.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdfparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\pdftypes.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\psparser.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\rijndael.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\runlength.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
copying build\lib\pdfminer\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\arcfour.py to arcfour.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ascii85.py to ascii85.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\ccitt.py to ccitt.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\cmapdb.py to cmapdb.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\converter.py to converter.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\encodingdb.py to encodingdb.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\fontmetrics.py to fontmetrics.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\glyphlist.py to glyphlist.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\image.py to image.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\latin_enc.py to latin_enc.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\layout.py to layout.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\lzw.py to lzw.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfcolor.py to pdfcolor.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdevice.py to pdfdevice.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfdocument.py to pdfdocument.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdffont.py to pdffont.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfinterp.py to pdfinterp.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfpage.py to pdfpage.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdfparser.py to pdfparser.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\pdftypes.py to pdftypes.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\psparser.py to psparser.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\rijndael.py to rijndael.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\runlength.py to runlength.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\utils.py to utils.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer\__init__.py to __init__.pyc
running install_scripts
copying build\scripts-2.7\dumppdf.py -> D:\tmp\dev_install_root\Python27_x64\Scripts
copying build\scripts-2.7\latin2ascii.py -> D:\tmp\dev_install_root\Python27_x64\Scripts
copying build\scripts-2.7\pdf2txt.py -> D:\tmp\dev_install_root\Python27_x64\Scripts
running install_egg_info
Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pdfminer-20131113-py2.7.egg-info

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>

然后再去试试。

然后在:

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools

中找到:

pdf2txt.py

然后去试试:

3.结果竟然出错:

D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>ls
spec183r21.0.pdf  xml

D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd -
The system cannot find the path specified.

D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf>cd D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>ls
Makefile  PKG-INFO  build     cmaprsrc  docs      pdfminer  samples   setup.py  tools

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113>cd tools

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>ls
Makefile           conv_afm.py        conv_cmap.py       conv_glyphlist.py  dumppdf.py         latin2ascii.py     pdf2html.cgi       pdf2txt.py         prof.py            runapp.py

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf\spec183
r21.0.pdf
Traceback (most recent call last):
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main
    caching=caching, check_extractable=True):
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages
    doc = PDFDocument(parser, caching=caching)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__
    xref.load(parser)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load
    objid1 = objs[index*2]
IndexError: list index out of range

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>

4.加了-t参数,也还是不行:

D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools>pdf2txt.py -t html -o hart183.html D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf
\spec183r21.0.pdf
Traceback (most recent call last):
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 110, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "D:\tmp\dev_tools\python\pdf\pdfminer-20131113.tar\dist\pdfminer-20131113\pdfminer-20131113\tools\pdf2txt.py", line 103, in main
    caching=caching, check_extractable=True):
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfpage.py", line 123, in get_pages
    doc = PDFDocument(parser, caching=caching)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 309, in __init__
    xref.load(parser)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pdfminer\pdfdocument.py", line 194, in load
    objid1 = objs[index*2]
IndexError: list index out of range

5.然后再去试试,看看能否用PDFMiner去解密,结果没有找到这些选项。。。

 

【总结】

最终放弃使用PDFMiner,暂时由于该程序有bug,无法用其将pdf转换为html或文本。

转载请注明:在路上 » 【记录】尝试使用PDFMiner将不可复制的PDF转换为文本或HTML

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址

网友最新评论 (1)

  1. 最新版的PDFMINER能够破解空字符的加密PDF,但如果是有密码的话.估计你怎么折腾都比较困难.没有解密算法的话,这些PDF是不可能直接转换的.
    WJFFOX9年前 (2016-05-13)回复
83 queries in 0.163 seconds, using 22.08MB memory