Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Comparison word against pdf

by Anonymous Monk
on Apr 16, 2013 at 18:37 UTC ( [id://1028985]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I are having a tough time doing the word to word check in pdf against the word document. I could not able to figure out how to do with pdf as it is highly complex binary pattern and word is an entirely a different thing from pdf. Can someone please suggest me on how to do this? Thanks in Advance.

Replies are listed 'Best First'.
Re: Comparison word against pdf
by LanX (Saint) on Apr 16, 2013 at 19:18 UTC
    If (!) it's possible to extract text¹ from PDF I'm using pdftohtml -xml and then I'm parsing the resulting xml by text position.

    see Parsing PDFs by text position?

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    ¹) some PDF formats use internal bitmaps for fonts such that only OCR would help.

Re: Comparison word against pdf
by thezip (Vicar) on Apr 16, 2013 at 18:59 UTC

    I've done some rudimentary parsing of PDF's using CAM::PDF's getPageText() method, but I was only able to deal with PDF v1.4 formatted files though (v1.5 and v1.6 I couldn't parse).

    I have not done anything similar in Word, but there must be something around that performs a similar extraction function.

    Once you've extracted each file, then you'd need to write the comparator function.


    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens, 1949-2011
Re: Comparison word against pdf
by rpnoble419 (Pilgrim) on Apr 16, 2013 at 19:18 UTC

    Because of how text is generated in PDF file this will be a next to impossible task. What may look like a complete word in the PDF file may actually be a combination of many letters or groups of letters. Also text does not flow in the same manner as in word.

    You can improve your chances of success if you know exactly how the PDF files were created and by what application. If you have access to Adobe Illustrator, you can import the PDF files and see how each page is constructed and this may give you insight in to how to read the PDF objects to extract the text.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1028985]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-23 07:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found