Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: PDF to Text

by greenFox (Vicar)
on Jan 24, 2005 at 08:37 UTC ( [id://424528]=note: print w/replies, xml ) Need Help??


in reply to PDF to Text

I don't know the answer to your question but a Super Seach for convert pdf to text reveals quite a few nodes on this topic including (as a quick sample): Can I convert a pdf to html with PDF::Extract??, pdf2txt?, Extract text from PDF and Reading PDF files. A quick skim through those nodes suggests the following modules might help: PDF::Extract and PDF::API2. Searching for pdf on CPAN reveals a few more potential candidates. Good luck and do let us know how you get on :)

--
Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

Replies are listed 'Best First'.
Re^2: PDF to Text
by chrism01 (Friar) on Jan 27, 2005 at 01:31 UTC
    I have actually had a look at those modules, but all they do is create/manipulate pdfs. eg PDF::API2 has a fn $string = $pdf->stringify, but this just dumps the file into a string still as pdf format ie you get a load of binary rubbish.
    As for PDF::Extract - "Extracting sub PDF documents from a multi page PDF document"; again output is pdf.
    I just need the bare ascii text that pdftotext gives, except it has the odd random glitch which makes the output corrupted in terms of layout.
    If I can't predict the layout, I can't parse it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://424528]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (8)
As of 2024-04-19 08:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found