Extract text from PDF (normal text)

noorullahe has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I need to extract text from PDF. I used CAM::PDF, PDF::API2. But all returns non-ascii characters, lower case letters to uppercase and each and every letter contains spaces. Kindly do the needful.

Regards,

Noorullah

Comment on Extract text from PDF (normal text)

Replies are listed 'Best First'.
Re: Extract text from PDF (normal text) by Ratazong (Monsignor) on Oct 15, 2009 at 06:25 UTC
Hi! It's in the nature of PDF that text isn't represented by a sequence of letters, but that each letter may be positioned in the document separately; the order of the letters/words inside the .pdf-file has to be in no relation to the order the text appears on the screen. This makes parsing .pdf-files extremely difficult. I used a program called pdftext.exe (which works quite well extracting whole words (at least in most cases)) and post-processed the result with perl. maybe its worth a try for you also... HTH, Rata	[reply]
Re: Extract text from PDF (normal text) by leocharre (Priest) on Oct 15, 2009 at 20:01 UTC
Might want to try PDF::OCR2	[reply]
Re: Extract text from PDF (normal text) by xbmy (Friar) on Jun 09, 2010 at 21:33 UTC
Try this, it works well for me, enjoy! `use warnings; use CAM::PDF; use CAM::PDF::PageText; my $infile = "?.pdf"; #the pdf file you want to extract my $outfile = "out.txt"; open (OUTFILE, ">>out.txt") or die("cannot open file : $!"); my $pdf = CAM::PDF->new($infile) \|\| die "$CAM::PDF::errstr\n"; my $num = $pdf->numPages(); foreach my $p (1..$num) # p present for the page number { my $str = $pdf->getPageText($p); CAM::PDF->asciify(\$str); print OUTFILE "$str\n"; # write to file } close (OUTFILE);` [download]	[reply] [d/l]
Re: Extract text from PDF (normal text) by LanX (Saint) on Jun 10, 2010 at 11:03 UTC
maybe this helps (for linux) Re: How to invoke pdftotext and extract first line of text from PDF file? Cheers Rolf	[reply]

Back to Seekers of Perl Wisdom