Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Extract text from PDF (normal text)

by noorullahe (Initiate)
on Oct 15, 2009 at 05:56 UTC ( [id://801280]=perlquestion: print w/replies, xml ) Need Help??

noorullahe has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I need to extract text from PDF. I used CAM::PDF, PDF::API2. But all returns non-ascii characters, lower case letters to uppercase and each and every letter contains spaces. Kindly do the needful.

Regards,

Noorullah

Replies are listed 'Best First'.
Re: Extract text from PDF (normal text)
by Ratazong (Monsignor) on Oct 15, 2009 at 06:25 UTC
    Hi!
    It's in the nature of PDF that text isn't represented by a sequence of letters, but that each letter may be positioned in the document separately; the order of the letters/words inside the .pdf-file has to be in no relation to the order the text appears on the screen.
    This makes parsing .pdf-files extremely difficult.
    I used a program called pdftext.exe (which works quite well extracting whole words (at least in most cases)) and post-processed the result with perl.
    maybe its worth a try for you also...
    HTH, Rata
Re: Extract text from PDF (normal text)
by leocharre (Priest) on Oct 15, 2009 at 20:01 UTC
Re: Extract text from PDF (normal text)
by xbmy (Friar) on Jun 09, 2010 at 21:33 UTC

    Try this, it works well for me, enjoy!

    use warnings; use CAM::PDF; use CAM::PDF::PageText; my $infile = "?.pdf"; #the pdf file you want to extract my $outfile = "out.txt"; open (OUTFILE, ">>out.txt") or die("cannot open file : $!"); my $pdf = CAM::PDF->new($infile) || die "$CAM::PDF::errstr\n"; my $num = $pdf->numPages(); foreach my $p (1..$num) # p present for the page number { my $str = $pdf->getPageText($p); CAM::PDF->asciify(\$str); print OUTFILE "$str\n"; # write to file } close (OUTFILE);
Re: Extract text from PDF (normal text)
by LanX (Saint) on Jun 10, 2010 at 11:03 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://801280]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (7)
As of 2024-04-19 09:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found