Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Dear Monks,

I need to build a book index (the book is in PDF format version 1.6) and I'm facing some problems which are related to the way encodings are "specified" in a PDF file. I've found CAM-PDF to be very useful in extracting the PDF content in textual form, so I'm following those steps to build the index:

  1. Extract each page's textual content (with CAM::PDF::getPageText()) saving in a single file.
  2. For each keyword, scan each file to see if there are any matches and save the results.
Despite the linearity of the solution, I've noticed that some characters aren't as expected when extracted: for instance the apostrophe character "'" is displayed as a question mark on my terminal (which uses UTF-8) and playing with binmode and some other encoding (Latin-1) did not helped.

I've practically zero knowledge of the inner workings of the PDF format, and I have a tight schedule at the moment, so I have no time to wade trough the 700+ pages of the PDF spec looking for how the PDF store "plain" text; despite this, knowing that PDFs are bin files, they "should" not encode text in any particular form, but they probably pack all the information in some sort of "structure". For what I've read about PDF files, they usually embed fonts and then map single glyphs to "bytes", resulting usually in some sort of custom encoding. This would explain why I see some characters (e.g.: apostrophes and prolonged dashes as gibberish). It would seems that the PDF at hand maps letters to ASCII while the rest of ASCII chars are somehow mapped to custom bytes. For instance following is the hexdump of an extract of the file containing the sentence "The developer, on the other hand, feels like he’s interrupted several times a day for meetings, "

00000000 54 68 65 20 64 65 76 65 6c 6f 70 65 72 2c 20 6f |The devel +oper, o| 00000010 6e 20 74 68 65 20 6f 74 68 65 72 20 68 61 6e 64 |n the oth +er hand| 00000020 2c 20 66 65 65 6c 73 20 6c 69 6b 65 20 68 65 80 |, feels l +ike he.| 00000030 73 20 69 6e 74 65 72 72 75 70 74 65 64 20 73 65 |s interru +pted se| 00000040 76 65 72 61 6c 20 74 69 6d 65 73 20 61 20 64 61 |veral tim +es a da| 00000050 79 20 66 6f 72 0a 6d 65 65 74 69 6e 67 73 2c 20 |y for.mee +tings, | 00000060 77 68 69 63 |whic|

and as you can see at offset 00000020, the apostrophe is extracted by CAM::PDF as an 0x80, which if I recall well is the EURO sign in ASCII.

My question is then: how can I solve the encoding thing? The keywords to index usually include only letters, but some could have dashes and anyway it feels a little dirty to match a text encoded in a custom/unknown format.

Do you know if PDFs carry the encoding info bit somewhere and any GNU/Linux tool to inspect the PDF to extract it (assuming the encoding is not custom)? I see somebody suggests to open the file in Acrobat Reader (Win) and usually all what is shown is "Custom encoding".

Given the situation I am thinking of scanning the extracted textual content for any byte which is *not* a letter in ASCII and if is not an ASCII byte mapping replace it with the proper ASCII byte values.

As side question: Text-Index is very helpful for the indexing phase, but it lacks a feature to weight the matching for each page (i.e.: a given keyword matches 10 times on page 1 but 2 times on page 10). Does anybody know if there's something on CPAN to help with this? I feel very lazy :).


In reply to Build a PDF book index by markong

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2024-04-18 16:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found