你的位置：在路上 > 工作和技术 > ProgrammingLanguage > Python > 【记录】用Python从pdf文件中提取文字数据信息

【记录】用Python从pdf文件中提取文字数据信息

Python crifan 12年前 (2013-05-20) 6327浏览

【背景】

已有一个pdf文件，效果如下：

想要用python从中提取一些信息。

【折腾过程】

1.搜了下，找到个：

pyPdf

http://pybrary.net/pyPdf/

其功能之一是：

“extracting document information (title, author, …),”

貌似是我们需要的。

其最新版本是

PyPDF2

http://knowah.github.io/PyPDF2/

然后再仔细看了看，结果发现貌似主要都是针对如何生成，处理pdf方面的，很少提到从pdf中提取信息的。

其中，其也顺带提到了：

也是用于生成pdf之类的。

2.也找到个：

但是主要也是用来生成pdf的：

“Simple PDF generation for Python (FPDF PHP port) AKA fpdf.py”

3.后来参考：

python提取pdf与word中的相关信息

得知：

然后看了其介绍，觉得比较适合此处使用：

What’s It?

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

有空就可以用其继续去折腾了。

4.

转载请注明：在路上 » 【记录】用Python从pdf文件中提取文字数据信息

Post Views: 1,618

与本文相关的文章

分类

81 queries in 0.272 seconds, using 19.02MB memory