http://qs321.pair.com?node_id=1211124

markong has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I need to build a book index (the book is in PDF format version 1.6) and I'm facing some problems which are related to the way encodings are "specified" in a PDF file. I've found CAM-PDF to be very useful in extracting the PDF content in textual form, so I'm following those steps to build the index:

  1. Extract each page's textual content (with CAM::PDF::getPageText()) saving in a single file.
  2. For each keyword, scan each file to see if there are any matches and save the results.
Despite the linearity of the solution, I've noticed that some characters aren't as expected when extracted: for instance the apostrophe character "'" is displayed as a question mark on my terminal (which uses UTF-8) and playing with binmode and some other encoding (Latin-1) did not helped.

I've practically zero knowledge of the inner workings of the PDF format, and I have a tight schedule at the moment, so I have no time to wade trough the 700+ pages of the PDF spec looking for how the PDF store "plain" text; despite this, knowing that PDFs are bin files, they "should" not encode text in any particular form, but they probably pack all the information in some sort of "structure". For what I've read about PDF files, they usually embed fonts and then map single glyphs to "bytes", resulting usually in some sort of custom encoding. This would explain why I see some characters (e.g.: apostrophes and prolonged dashes as gibberish). It would seems that the PDF at hand maps letters to ASCII while the rest of ASCII chars are somehow mapped to custom bytes. For instance following is the hexdump of an extract of the file containing the sentence "The developer, on the other hand, feels like he’s interrupted several times a day for meetings, "

00000000 54 68 65 20 64 65 76 65 6c 6f 70 65 72 2c 20 6f |The devel +oper, o| 00000010 6e 20 74 68 65 20 6f 74 68 65 72 20 68 61 6e 64 |n the oth +er hand| 00000020 2c 20 66 65 65 6c 73 20 6c 69 6b 65 20 68 65 80 |, feels l +ike he.| 00000030 73 20 69 6e 74 65 72 72 75 70 74 65 64 20 73 65 |s interru +pted se| 00000040 76 65 72 61 6c 20 74 69 6d 65 73 20 61 20 64 61 |veral tim +es a da| 00000050 79 20 66 6f 72 0a 6d 65 65 74 69 6e 67 73 2c 20 |y for.mee +tings, | 00000060 77 68 69 63 |whic|

and as you can see at offset 00000020, the apostrophe is extracted by CAM::PDF as an 0x80, which if I recall well is the EURO sign in ASCII.

My question is then: how can I solve the encoding thing? The keywords to index usually include only letters, but some could have dashes and anyway it feels a little dirty to match a text encoded in a custom/unknown format.

Do you know if PDFs carry the encoding info bit somewhere and any GNU/Linux tool to inspect the PDF to extract it (assuming the encoding is not custom)? I see somebody suggests to open the file in Acrobat Reader (Win) and usually all what is shown is "Custom encoding".

Given the situation I am thinking of scanning the extracted textual content for any byte which is *not* a letter in ASCII and if is not an ASCII byte mapping replace it with the proper ASCII byte values.

As side question: Text-Index is very helpful for the indexing phase, but it lacks a feature to weight the matching for each page (i.e.: a given keyword matches 10 times on page 1 but 2 times on page 10). Does anybody know if there's something on CPAN to help with this? I feel very lazy :).

Replies are listed 'Best First'.
Re: Build a PDF book index
by thanos1983 (Parson) on Mar 17, 2018 at 12:11 UTC
Re: Build a PDF book index
by poj (Abbot) on Mar 17, 2018 at 11:58 UTC
        Thank you, this tool extracted the text contents successfully, with apostrophes and (prolonged) dashes encoded as Latin-1!
Re: Build a PDF book index
by LanX (Saint) on Mar 17, 2018 at 11:49 UTC
    Tl;dr, but

    >  I've noticed that some characters aren't as expected when extracted: 

    PDF allows to embed it's own fonts, and the encoding of characters is sometimes random then.

    You can solve it for a specific PDF document only by scanning the affected font number and manually building a translation table into a hash.

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: Build a PDF book index
by choroba (Cardinal) on Mar 17, 2018 at 11:50 UTC
    > 0x80, which if I recall well is the EURO sign in ASCII.

    ASCII is 7-bit, so 0x80 doesn't exist in it. Also, ASCII is much older than Euro (1963 versus 1996).

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      I meant Extended ASCII or ISO Latin‑1