Parsing PDFs by text position?

LanX has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse PDFs of account balances.

ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns.

Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ...

So are there moduls to parse PDFs (or texts) by clipping-positions?

And for texts is there anything to reverse the effect of format?

Cheers Rolf

Actually I have two problems:

a) to get the precise word positions,

since pdftohtml -xml doesn't break up at all whitespaces:

<text top="239" left="33" width="491" height="7" font="2">28.12. 28.12. 0036 Kartenverfüg 39,75 -</text>

b) defining 2 dimensional scan templates (reversing format)

I already got pretty far, but I was wondering if there is a recommended way to do it...

other threads about pdf parsing are:

BTW: It's not an OCR issue, I can get all characters ...

Comment on Parsing PDFs by text position? Select or Download Code

Replies are listed 'Best First'.
Re: Parsing PDFs by text position? by djp (Hermit) on Mar 28, 2010 at 11:02 UTC
> I'm trying to parse PDFs of account balances. Where did this crazy requirement come from?	[reply]
Re^2: Parsing PDFs by text position? by deep3101 (Acolyte) on Jun 01, 2011 at 02:05 UTC
how does the PDF file look like when it is converted to TEXT, if it is separated by tabs or conspicuous spaces then you can use it to write it as xls sheet by SpreadSheet::Wright and then you can handle it easily.	[reply]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks