Re: parse content of PDF file


XP is just a number
	PerlMonks

Re: parse content of PDF file

by marto (Cardinal)

on Aug 03, 2007 at 13:55 UTC ( [id://630513]=note: print w/replies, xml )

Need Help??

in reply to parse content of PDF file

Had they been converted to PDF via Acrobat (or such like) rather than scanned Images I would have suggested looking at CAM::PDF, however I think you are going to have to OCR each page of each document, since IIRC there won't be any (meaningful) text to parse within the PDF. You may want to start by looking at PDF::OCR (which IIRC uses Tesseract) , or some other OCR module from CPAN.

Check out the code.google page for tesseract-ocr

Update: Added link to tesseract-ocr

Hope this helps

Martin

Comment on Re: parse content of PDF file

Replies are listed 'Best First'.
Re^2: parse content of PDF file by archfool (Monk) on Aug 03, 2007 at 14:07 UTC
Cool! There is some software out there for OCR! I'm going to check it out myself! :)	[reply]

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://630513]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others meditating upon the Monastery: (1)

chatterbot

As of 2024-04-25 07:32 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found