【背景】
折腾:
【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据
期间,去试试使用pyPdf去把一个不可复制的PDF文件,转换为文本或HTML。
【折腾过程】
1.参考:
Convert PDF to text with pyPDF and PDFMiner: First Impression | victorwyee
去找到:
并下载:
2.但是安装时找不到Python:
看来是:
我此处安装的x64的python,此处无法识别啊。。。
3.重新下载:
然后去解压安装:
D:\tmp\dev_tools\python\pdf\pyPdf-1.13\pyPdf-1.13>python setup.py install running install running build running build_py creating build creating build\lib creating build\lib\pyPdf copying pyPdf\filters.py -> build\lib\pyPdf copying pyPdf\generic.py -> build\lib\pyPdf copying pyPdf\pdf.py -> build\lib\pyPdf copying pyPdf\utils.py -> build\lib\pyPdf copying pyPdf\xmp.py -> build\lib\pyPdf copying pyPdf\__init__.py -> build\lib\pyPdf running install_lib creating D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\filters.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\generic.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\pdf.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\utils.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\xmp.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf copying build\lib\pyPdf\__init__.py -> D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\filters.py to filters.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\generic.py to generic.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\pdf.py to pdf.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\utils.py to utils.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\xmp.py to xmp.pyc byte-compiling D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf\__init__.py to __init__.pyc running install_egg_info Writing D:\tmp\dev_install_root\Python27_x64\Lib\site-packages\pyPdf-1.13-py2.7.egg-info
然后去试试。
#!/usr/bin/python # -*- coding: utf-8 -*- """ Function: 【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据 https://www.crifan.com/non_copy_pdf_table_data_export_to_xml Author: Crifan Li Version: 2014-01-26 Contact: https://www.crifan.com/about/me """ import os import glob from pyPdf import PdfFileReader def pdf_table_to_xml(): """Operate PDF file, extract table data, save to xml""" parent = "D:/tmp/tmp_dev_root/python/answer_question/self/pdf_table_to_xml/pdf" os.chdir(parent) pdfFilename = "spec183r21.0.pdf"; filename = os.path.abspath(pdfFilename) input = PdfFileReader(file(filename, "rb")) for page in input.pages: print page.extractText() if __name__ == "__main__": pdf_table_to_xml();
结果运行出错,说是没解密:
D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml>pdf_table_to_xml.py Traceback (most recent call last): File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 29, in <module> pdf_table_to_xml(); File "D:\tmp\tmp_dev_root\python\answer_question\self\pdf_table_to_xml\pdf_table_to_xml.py", line 25, in pdf_table_to_xml for page in input.pages: File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 78, in __getitem__ len_self = len(self) File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\utils.py", line 73, in __len__ return self.lengthFunction() File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 431, in getNumPages self._flatten() File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 596, in _flatten catalog = self.trailer["/Root"].getObject() File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 480, in __getitem__ return dict.__getitem__(self, key).getObject() File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\generic.py", line 165, in getObject return self.pdf.getObject(self).getObject() File "D:\tmp\dev_install_root\Python27_x64\lib\site-packages\pyPdf\pdf.py", line 655, in getObject raise Exception, "file has not been decrypted" Exception: file has not been decrypted
4.然后再去解决上述问题:
没找到解决办法。
其中:
How can I read a pdf web page? | DaniWeb
说是,其代码对于其他pdf正常,所以无视此bug。。。
【总结】
目前也是无法通过pyPdf将上述不可拷贝的pdf转换为想要的文本或html。
转载请注明:在路上 » 【记录】尝试使用pyPdf将不可复制的PDF转换为文本或HTML