【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

D:\tmp\dev_tools\python\pdf\pyPdf-1.13\pyPdf-1.13>python setup.py install
running install
running build
running build_py
creating build
creating build\lib
creating build\lib\pyPdf
copying pyPdf\filters.py -> build\lib\pyPdf
copying pyPdf\generic.py -> build\lib\pyPdf
copying pyPdf\pdf.py -> build\lib\pyPdf
copying pyPdf\utils.py -> build\lib\pyPdf
copying pyPdf\xmp.py -> build\lib\pyPdf
copying pyPdf\__init__.py -> build\lib\pyPdf
running install_lib
creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\filters.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\generic.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\pdf.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\xmp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
copying build\lib\pyPdf\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\filters.py to filters.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\generic.py to generic.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\pdf.py to pdf.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\utils.py to utils.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\xmp.py to xmp.pyc
byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\__init__.py to __init__.pyc
running install_egg_info
Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf-1.13-py2.7.egg-info

然后去试试。

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
<blockquote class="wp-embedded-content" data-secret="gjBhlpYXmV" style="display: none;"><a href="https://www.crifan.com/non_copy_pdf_table_data_export_to_xml/" data-original-title="" title="">【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据</a></blockquote><iframe class="wp-embedded-content" sandbox="allow-scripts" security="restricted" title="《 【已解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据 》—在路上" src="https://www.crifan.com/non_copy_pdf_table_data_export_to_xml/embed/#?secret=sbUpGrRhNu#?secret=gjBhlpYXmV" data-secret="gjBhlpYXmV" width="500" height="204" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe>
 
Author:     Crifan Li
Version:    2014-01-26
Contact:    https://www.crifan.com/about/me
"""
 
import os
import glob
from pyPdf import PdfFileReader
 
def pdf_table_to_xml():
    """Operate PDF file, extract table data, save to xml"""
    parent = "D:/tmp/tmp_dev_root/python/answer_question/self/pdf_table_to_xml/pdf"
    os.chdir(parent)
    pdfFilename = "spec183r21.0.pdf";
    filename = os.path.abspath(pdfFilename)
 
    input = PdfFileReader(file(filename, "rb"))
    for page in input.pages:
        print page.extractText()
 
if __name__ == "__main__":
    pdf_table_to_xml();

结果运行出错，说是没解密：

D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml>pdf_table_to_xml.py
Traceback (most recent call last):
  File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 29, in <module>
    pdf_table_to_xml();
  File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 25, in pdf_table_to_xml
    for page in input.pages:
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 78, in __getitem__
    len_self = len(self)
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 73, in __len__
    return self.lengthFunction()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 431, in getNumPages
    self._flatten()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten
    catalog = self.trailer["/Root"].getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__
    return dict.__getitem__(self, key).getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 165, in getObject
    return self.pdf.getObject(self).getObject()
  File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 655, in getObject
    raise Exception, "file has not been decrypted"
Exception: file has not been decrypted

4.然后再去解决上述问题：

没找到解决办法。

其中：

How can I read a pdf web page? | DaniWeb

说是，其代码对于其他pdf正常，所以无视此bug。。。

【总结】

目前也是无法通过pyPdf将上述不可拷贝的pdf转换为想要的文本或html。

转载请注明：在路上 » 【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML

Post Views: 1,742

与本文相关的文章

订阅在路上