mirror of
https://github.com/Ponce/slackbuilds
synced 2024-12-05 00:55:44 +01:00
8ee80adc21
Signed-off-by: Willy Sudiarto Raharjo <willysr@slackbuilds.org>
23 lines
1 KiB
Text
23 lines
1 KiB
Text
PDFMiner is a tool for extracting information from PDF documents. Unlike
|
|
other PDF-related tools, it focuses entirely on getting and analyzing
|
|
text data. PDFMiner allows one to obtain the exact location of text in a
|
|
page, as well as other information such as fonts or lines. It includes a
|
|
PDF converter that can transform PDF files into other text formats (such
|
|
as HTML). It has an extensible PDF parser that can be used for other
|
|
purposes than text analysis.
|
|
|
|
PDFMiner comes with two handy tools: pdf2txt.py and dumppdf.py.
|
|
|
|
pdf2txt.py
|
|
|
|
pdf2txt.py extracts text contents from a PDF file. It cannot recognize
|
|
text drawn as images. It also extracts locations, font names/sizes,
|
|
writing direction. It requires a password for password protected PDF
|
|
documents. You cannot extract any text from a PDF document which does
|
|
not have extraction permission.
|
|
|
|
dumppdf.py
|
|
|
|
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML
|
|
format. This program is primarily for debugging purposes, but it's also
|
|
possible to extract some meaningful contents (e.g. images).
|