Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Hi
I'm trying to parse PDFs of account balances. ATM I'm piping them through pdftotext -layout to get a text representation respecting the positions...since the fields are in different columns. Unfortunately this becomes more hairy than I thought and now I'm wondering if I'm reinventing a CPAN wheel I can't find ... So are there moduls to parse PDFs (or texts) by clipping-positions? And for texts is there anything to reverse the effect of format?
Cheers Rolf Actually I have two problems: a) to get the precise word positions, since pdftohtml -xml doesn't break up at all whitespaces: <text top="239" left="33" width="491" height="7" font="2">28.12. 28.12. 0036 Kartenverfüg 39,75 -</text> b) defining 2 dimensional scan templates (reversing format) I already got pretty far, but I was wondering if there is a recommended way to do it... other threads about pdf parsing are: * Re: parse content of PDF file BTW: It's not an OCR issue, I can get all characters ... In reply to Parsing PDFs by text position? by LanX
|
|