Comparison word against pdf

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparison word against pdf by LanX (Saint) on Apr 16, 2013 at 19:18 UTC
If (!) it's possible to extract textš from PDF I'm using `pdftohtml -xml` and then I'm parsing the resulting xml by text position. see Parsing PDFs by text position? Cheers Rolf ( addicted to the Perl Programming Language) š) some PDF formats use internal bitmaps for fonts such that only OCR would help.	[reply] [d/l]
Re: Comparison word against pdf by thezip (Vicar) on Apr 16, 2013 at 18:59 UTC
I've done some rudimentary parsing of PDF's using CAM::PDF's getPageText() method, but I was only able to deal with PDF v1.4 formatted files though (v1.5 and v1.6 I couldn't parse). I have not done anything similar in Word, but there must be something around that performs a similar extraction function. Once you've extracted each file, then you'd need to write the comparator function. What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011	[reply]
Re^2: Comparison word against pdf by hdb (Monsignor) on Apr 16, 2013 at 19:04 UTC
I just posted this Re: Your lack of ambition is troubling to extract from Word.	[reply]
Re: Comparison word against pdf by rpnoble419 (Pilgrim) on Apr 16, 2013 at 19:18 UTC
Because of how text is generated in PDF file this will be a next to impossible task. What may look like a complete word in the PDF file may actually be a combination of many letters or groups of letters. Also text does not flow in the same manner as in word. You can improve your chances of success if you know exactly how the PDF files were created and by what application. If you have access to Adobe Illustrator, you can import the PDF files and see how each page is constructed and this may give you insight in to how to read the PDF objects to extract the text.	[reply]


Problems? Is your data what you think it is?
	PerlMonks