PDF Text

bmac has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: PDF Text by leocharre (Priest) on Jun 12, 2008 at 18:38 UTC
Check out my PDF::OCR. If you send me any feedback or requests, I tend to revise quickly. I also have FileArchiveIndexer, which does exactly what you mention, indexing pdf files content via ocr or text. I would really love to work with someone else to get it to production level. It works well, lets you sync various machines on one network to index etc.. Which you will need if you're doing ocr on 60k docs.	[reply]
Re: PDF Text by MidLifeXis (Monsignor) on Jun 12, 2008 at 18:04 UTC
Do a search on CPAN to see if you find anything useful there. PDF::CAM seems to have a couple of functions that might work. Extracting the layout from a PDF files into a text file might still be problematic. It will be problematic if the page does not contain text at all, but contains a graphic image of a page instead. You would need to use some sort of OCR solution then. --MidLifeXis	[reply]
Re: PDF Text by marto (Cardinal) on Jun 12, 2008 at 18:10 UTC
Welcome to the Monastery, take a look at the PerlMonks FAQ and How do I post a question effectively?. This question gets asked quite frequently here, check out Extracting content text from PDFs or user Super Search for further examples. Hope this helps Martin	[reply]
Re: PDF Text by TGI (Parson) on Jun 12, 2008 at 18:45 UTC
Why write your own when you could use something like SWISH-E or ht://dig? TGI says moo	[reply]
Re: PDF Text by radiantmatrix (Parson) on Jun 12, 2008 at 20:36 UTC
Why would you write this at all? There are a number of pre-existing solutions to searching for information inside PDFs; Google's Search Appliances, for example. Most of these solutions allow you to search quickly inside many types of document. It's got to be cheaper to buy an appliance than to spend your time building a search engine... especially since you're new to Perl. Searching is harder than it looks: let someone with way more resources than you solve the problem, and just use their solution! <–radiant.matrix–> Ramblings and references “A positive attitude may not solve all your problems, but it will annoy enough people to make it worth the effort.” — Herm Albright I haven't found a problem yet that can't be solved by a well-placed trebuchet	[reply]
Re^2: PDF Text by leocharre (Priest) on Jun 12, 2008 at 21:09 UTC
Indexing and searching should be attacked as very separate problems. For example in my situation, there's not much out there to turn a few gigs of raw paper document scans into a searchable database. So my focus is on hacking together indexing (Hence FileArchiveIndexer)- The search is iffy- but it's wide open to someone to reach in and work with it. I agree completely, searching is hard as all heck- there are a lot of ways to do it. You can't do a project like this thinking 'indexing and searching pdf files'- you'll go ape with the details- sounds simple.. but.. oh boy oh boy :-) I wouldn't discourage writting things like these from scratch- I would advise against it if possible.. but.. Shucks.. maybe this hacker will come up with something interesting. Or at least be humbled out of the ryo idea next time !	[reply]
Re: PDF Text by hesco (Deacon) on Jun 13, 2008 at 02:24 UTC
I've not used it, but will underscore the recommendation for swish-e, based on what I've heard about it. But to answer your specific question, I use pdftotext to extract the ascii text from a compliant pdf file. Its a bash command line tool which is distributed with the xpdf reader application in many linux distributions. It won't work on scanned images (for which that PDF::OCR sounds particularly interesting; I'll have to check that out, ++ and thanks!). But for folks who export editable documents to PDF, it works like a charm (though is challenged a bit by multi-column content). -- Hugh if( $lal && $lol ) { $life++; }	[reply]
Re^2: PDF Text by leocharre (Priest) on Jun 13, 2008 at 13:38 UTC
Something really interesting that happened at my office.. We scan in a lot of documents. Now, the machines are able to encode OCR into the pdf document created. This makes indexing the documents relatively easy. BUT - Guess what! They don't want to use the scanner's OCR tech! Because they say it slows down scanning! And- well for five pages who cares. But for 200 page documents??? They have a point. So I have my thing run at night.. collect info etc. That's why I needed muscle.	[reply]


go ahead... be a heretic
	PerlMonks