Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.

Parsing PDFs by text position?

by LanX (Saint)
on Mar 26, 2010 at 16:33 UTC ( #831190=perlquestion: print w/replies, xml ) Need Help??

LanX has asked for the wisdom of the Perl Monks concerning the following question:


I'm trying to parse PDFs of account balances.

ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.

Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...

So are there moduls to parse PDFs (or texts) by clipping-positions?

And for texts is there anything to reverse the effect of format?

Cheers Rolf


Actually I have two problems:

a) to get the precise word positions,

since pdftohtml -xml doesn't break up at all whitespaces:

<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12.    0036 Kartenverfüg                                                  39,75 -</text>

b) defining 2 dimensional scan templates (reversing format)

I already got pretty far, but I was wondering if there is a recommended way to do it...

other threads about pdf parsing are:

* How to parse PDF

* PDF Parsing

* Re: parse content of PDF file

BTW: It's not an OCR issue, I can get all characters ...

Replies are listed 'Best First'.
Re: Parsing PDFs by text position?
by djp (Hermit) on Mar 28, 2010 at 11:02 UTC
    > I'm trying to parse PDFs of account balances.

    Where did this crazy requirement come from?

      how does the PDF file look like when it is converted to TEXT, if it is separated by tabs or conspicuous spaces then you can use it to write it as xls sheet by SpreadSheet::Wright and then you can handle it easily.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://831190]
Approved by marto
Front-paged by Old_Gray_Bear
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (6)
As of 2023-11-28 10:58 GMT
Find Nodes?
    Voting Booth?

    No recent polls found