mirror of
https://github.com/Ponce/slackbuilds
synced 2024-11-29 13:00:32 +01:00
24 lines
1 KiB
Text
24 lines
1 KiB
Text
|
PDFMiner is a tool for extracting information from PDF documents. Unlike
|
||
|
other PDF-related tools, it focuses entirely on getting and analyzing
|
||
|
text data. PDFMiner allows one to obtain the exact location of text in a
|
||
|
page, as well as other information such as fonts or lines. It includes a
|
||
|
PDF converter that can transform PDF files into other text formats (such
|
||
|
as HTML). It has an extensible PDF parser that can be used for other
|
||
|
purposes than text analysis.
|
||
|
|
||
|
PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.
|
||
|
|
||
|
pdf2txt.py
|
||
|
|
||
|
pdf2txt.py extracts text contents from a PDF file. It cannot recognize
|
||
|
text drawn as images. It also extracts locations, font names/sizes,
|
||
|
writing direction. It requires a password for password protected PDF
|
||
|
documents. You cannot extract any text from a PDF document which does
|
||
|
not have extraction permission.
|
||
|
|
||
|
dumppdf.py
|
||
|
|
||
|
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML
|
||
|
format. This program is primarily for debugging purposes, but it's also
|
||
|
possible to extract some meaningful contents (e.g. images).
|