Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

extract text from pdf

by jeteve (Pilgrim)
on Nov 08, 2006 at 12:30 UTC ( [id://582868]=perlquestion: print w/replies, xml ) Need Help??

jeteve has asked for the wisdom of the Perl Monks concerning the following question:

Hi wise monks.

I wonder what is the simpliest solution to extract text from pdf in perl.

Of course I can use pdftotext in command line, but it involves managing temporary files ..

So I'm looking for a pure perl solution (or linked to a C library)..

I had a look at PDF::API2 , but it's more dedicated to creation.

CAM::PDF seammt to fill my need, but I can't manage to use it to extract the text ..

I also had a look at SWISH, but it internally uses ... pdftotext :) ..

Any Idea ?

-- Nice photos of naked perl sources here !

Replies are listed 'Best First'.
Re: extract text from pdf
by fenLisesi (Priest) on Nov 08, 2006 at 12:54 UTC
    CAM::PDF was recommended in the earlier thread How toread the contents of PDF

    Update: I tried a few things with this module. It works well with some pdf files, but seems to fail in various ways for others. I couldn't get it to work with a few simple pdf files I created and exported from OpenOffice. The module comes with a small script named getpdftext.pl that may help you. Cheers.

Re: extract text from pdf
by mk. (Friar) on Nov 08, 2006 at 13:10 UTC
    have you tried File::Extract::PDF?!
    it uses CAM::PDF internally, but maybe you have better luck with it.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    *women.pm
      I did try both of those .. without success.

      I got a pdf I've created with openoffice and pdftotext is able to extract text from it, whereas CAM::PDF (or File::Extract::PDF) gives me messy characters.

      [jerome@saab pdf]$ getpdftext.pl -v ~/faxTaxHabitation2005.pdf                                                  ! " #  $  % # & ' ( "  ) * + + + ...
      And pdftotext:
      [jerome@saab pdf]$ pdftotext ~/faxTaxHabitation2005.pdf txt [jerome@saab pdf]$ tail txt Merci de bien vouloir me confirmer ces informations par retour de fax +afin que je puisse proceder au paiment le plus rapidement possible au + numero suivant : ************* Cordiales salutations. ...

      The ideal would be a perl module linked to the xpdf C code .. :)

      -- Nice photos of naked perl sources here !

Re: extract text from pdf
by Anonymous Monk on Nov 08, 2006 at 16:02 UTC
    What do you mean "involved managing temporary files"?
    open $fh, "pdftotext whatever.pdf - |" or die; ... read text from $fh ...

      If I want just the PDFs text to use it for whatever (save it in a database, ...) I found this line quiete convenient:

      my $txt = `pdftotext whatever.pdf -` or die 'ERROR running pdftotext'; say $txt;
      Or if the file-name is in a variable and the PDF-file contains umlauts or other non-ascii chars:
      my $command_line = qq{pdftotext -enc 'UTF-8' '$path' -}; my $text = `$command_line` or die 'ERROR running pdftotext';
Re: extract text from pdf
by caelifer (Scribe) on Nov 08, 2006 at 15:39 UTC
    Not really a Perl solution, but... Acrobat Reader 7 supports 'Save As Text' option. Why not to try this one out. Obviously, this wont work for documents made from images, but nothing short of OCR will.

    -BR

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://582868]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-04-23 07:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found