Parse PDF to text

doubledecker has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parse PDF to text by Corion (Patriarch) on May 18, 2011 at 11:12 UTC
Have you tried Super Search for "pdf text"?	[reply]
Re: Parse PDF to text by LanX (Saint) on May 18, 2011 at 11:27 UTC
see tips and included links in Parsing PDFs by text position? Cheers Rolf	[reply]
Re: Parse PDF to text by soliplaya (Beadle) on May 18, 2011 at 15:47 UTC
Hi. We use the "poppler" library (http://poppler.freedesktop.org/) to extract the text of PDFs (several hundreds of them per day), with generally very good results. You still have to process the resulting text to extract what you want though. But you should be aware that not all PDFs "are" text. Many of the documents presented as PDF and looking like text, are in fact a scanned image of a text, embedded in a PDF. There can also be a mixture of real text and text images in the same PDF. None of the "PDF text extractors" will help you with those, and the only real way to deal with them is to reconvert them to an image, and do OCR on them.	[reply]
Re: Parse PDF to text by runrig (Abbot) on May 18, 2011 at 15:22 UTC
My experience has been that when you need to parse the document, the pdftotext utility does the best job of preserving the layout of the original. YMMV. Update: I have not tried "poppler" mentioned below. I downloaded it, tried to compile it (and failed), and don't have time ATM to mess with compiling issues :-(	[reply]
Re: Parse PDF to text by tune (Curate) on May 18, 2011 at 14:08 UTC
Try CAM::PDF -- tune	[reply]
Re: Parse PDF to text by Khen1950fx (Canon) on May 18, 2011 at 20:37 UTC
I usually get good results with Text::FromAny. `#!/usr/bin/perl use strict; use warnings; use Text::FromAny; my $tFromAny = Text::FromAny->new( file => '/root/Desktop/some.pdf'); print my $text = $tFromAny->text, "\n";` [download]	[reply] [d/l]
Re^2: Parse PDF to text by doubledecker (Scribe) on May 23, 2011 at 08:55 UTC
I tried pdftotext and got good results, but needs much of data parsing. Let me give a try on Text::FromAny and will post my updates.	[reply]


Think about Loose Coupling
	PerlMonks