Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Parse PDF to text

by doubledecker (Scribe)
on May 18, 2011 at 11:09 UTC ( [id://905450]=perlquestion: print w/replies, xml ) Need Help??

doubledecker has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

What is the best way to parse PDF to text. I tried using pdftotext utility but needs parsing of data again to pull the data what I want.

Any suggestions are appreciated

Replies are listed 'Best First'.
Re: Parse PDF to text
by Corion (Patriarch) on May 18, 2011 at 11:12 UTC
Re: Parse PDF to text
by LanX (Saint) on May 18, 2011 at 11:27 UTC
Re: Parse PDF to text
by soliplaya (Beadle) on May 18, 2011 at 15:47 UTC
    Hi.

    We use the "poppler" library (http://poppler.freedesktop.org/) to extract the text of PDFs (several hundreds of them per day), with generally very good results. You still have to process the resulting text to extract what you want though.

    But you should be aware that not all PDFs "are" text. Many of the documents presented as PDF and looking like text, are in fact a scanned image of a text, embedded in a PDF. There can also be a mixture of real text and text images in the same PDF. None of the "PDF text extractors" will help you with those, and the only real way to deal with them is to reconvert them to an image, and do OCR on them.

Re: Parse PDF to text
by runrig (Abbot) on May 18, 2011 at 15:22 UTC
    My experience has been that when you need to parse the document, the pdftotext utility does the best job of preserving the layout of the original. YMMV.

    Update: I have not tried "poppler" mentioned below. I downloaded it, tried to compile it (and failed), and don't have time ATM to mess with compiling issues :-(

Re: Parse PDF to text
by tune (Curate) on May 18, 2011 at 14:08 UTC
Re: Parse PDF to text
by Khen1950fx (Canon) on May 18, 2011 at 20:37 UTC
    I usually get good results with Text::FromAny.
    #!/usr/bin/perl use strict; use warnings; use Text::FromAny; my $tFromAny = Text::FromAny->new( file => '/root/Desktop/some.pdf'); print my $text = $tFromAny->text, "\n";

      I tried pdftotext and got good results, but needs much of data parsing. Let me give a try on Text::FromAny and will post my updates.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://905450]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (10)
As of 2024-04-23 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found