【记录】尝试使用pdftohtml将不可拷贝的PDF文件转换为HTML并保留表格的格式

【背景】

折腾：

【未解决】将不可拷贝复制的PDF中的表格数据导出并转换为xml格式数据

期间，去试试用pdftohtml，将一个不可拷贝的pdf文件，转换为文本或html。

【折腾过程】

1.继续参考：

Howto Convert PDF files to HTML files | Ubuntu Geek

去想办法找到pdftohtml，然后是可以安装并使用pdftohtml，加上-nodrm参数，转换出来html了：

log如下：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

crifan@crifan-Ubuntu:~$ sudo apt-get install poppler-utils
[sudo] password for crifan: 
正在读取软件包列表... 完成
正在分析软件包的依赖关系树       
正在读取状态信息... 完成       
poppler-utils 已经是最新的版本了。
升级了 0 个软件包，新安装了 0 个软件包，要卸载 0 个软件包，有 26 个软件包未被升级。
crifan@crifan-Ubuntu:~$ pdf
pdf2dsc      pdffonts     pdfseparate  pdftoppm     pdfunite     
pdf2ps       pdfimages    pdftocairo   pdftops      
pdfdetach    pdfinfo      pdftohtml    pdftotext    
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/
19#21#_(101~303).dwg             spec183r21.0.pdf
examples.desktop                 test_share
python_beginner_tutorial.html    unbuntu 13.04 in virtualbox.png
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/spec183r21.0.pdf /home/crifan/develop/
crosstool-ng/ ubuntu_share/ 
crifan@crifan-Ubuntu:~$ pdftohtml /media/sf_win7_to_ubuntu/spec183r21.0.pdf /home/crifan/develop/^Ccrifan@crifan-Ubuntu:~$ pwd
/home/crifan
crifan@crifan-Ubuntu:~$ cd develop/
crifan@crifan-Ubuntu:~/develop$ mkdir pdf_to_html
crifan@crifan-Ubuntu:~/develop$ cd pdf_to_html/
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml^Cmedia/sf_win7_to_ubuntu/spec183r21.0.pdf hart183.html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml --help
I/O Error: Couldn't open file '--help': --help.
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml -h
pdftohtml version 0.20.5
Copyright 2005-2012 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1999-2003 Gueorgui Ovtcharov and Rainer Dorsch
Copyright 1996-2011 Glyph & Cog, LLC
 
Usage: pdftohtml [options] <PDF-file> [<html-file> <xml-file>]
  -f <int>          : first page to convert
  -l <int>          : last page to convert
  -q                : don't print any messages or errors
  -h                : print usage information
  -help             : print usage information
  -p                : exchange .pdf links by .html
  -c                : generate complex document
  -s                : generate single document that includes all pages
  -i                : ignore images
  -noframes         : generate no frames
  -stdout           : use standard output
  -zoom <fp>        : zoom the pdf document (default 1.5)
  -xml              : output for XML post-processing
  -hidden           : output hidden text
  -nomerge          : do not merge paragraphs
  -enc <string>     : output text encoding name
  -dev <string>     : output device name for Ghostscript (png16m, jpeg etc)
  -fmt <string>     : image file format for Splash output (png or jpg)
  -v                : print copyright and version info
  -opw <string>     : owner password (for encrypted files)
  -upw <string>     : user password (for encrypted files)
  -nodrm            : override document DRM settings
  -wbt <fp>         : word break threshold (default 10 percent)
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ pdftohtml -nodrm /media/sf_win7_to_ubuntu/spec183r21.0.pdf hart183.htmlDocument has copy-protection bit set.
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22
Page-23
Page-24
Page-25
Page-26
Page-27
Page-28
Page-29
Page-30
Page-31
Page-32
Page-33
Page-34
Page-35
Page-36
Page-37
Page-38
Page-39
Page-40
 link to page 41 Page-41
Page-42
Page-43
Page-44
Page-45
Page-46
Page-47
Page-48
Page-49
Page-50
Page-51
Page-52
Page-53
Page-54
Page-55
Page-56
Page-57
Page-58
Page-59
Page-60
Page-61
Page-62
Page-63
Page-64
Page-65
Page-66
Page-67
Page-68
Page-69
Page-70
Page-71
Page-72
Page-73
Page-74
Page-75
Page-76
Page-77
Page-78
Page-79
Page-80
Page-81
Page-82
Page-83
Page-84
Page-85
Page-86
Page-87
Page-88
Page-89
Page-90
Page-91
Page-92
Page-93
Page-94
Page-95
Page-96
Page-97
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ ls
hart183-1_1.png  hart183-1_2.png  hart183-2_1.png  hart183.html  hart183_ind.html  hart183s.html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ cp * /media/sf_win7_to_ubuntu/^C
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ mkdir /media/sf_win7_to_ubuntu/pdf_to_html
crifan@crifan-Ubuntu:~/develop/pdf_to_html$ cp * /media/sf_win7_to_ubuntu/pdf_to_html/
crifan@crifan-Ubuntu:~/develop/pdf_to_html$