Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Read table data from PDF

by perlmad (Sexton)
on May 11, 2016 at 09:38 UTC ( [id://1162723]=perlquestion: print w/replies, xml ) Need Help??

perlmad has asked for the wisdom of the Perl Monks concerning the following question:

Hi Folks

I am stranger to Perl and Now my task is based on pdf file

I have used CAM::PDF module to read pdf content unfortunately i unable to read table data from pdf file

pdf is contain some text and table's , I need to parse these table data and write into spreadsheet

code is over here

my $pdf = CAM::PDF->new("filename.pdf"); my $page1 = $pdf->getPageText(1); print " page content is : $page1 \n\n\n";

Output

  page content is :

I need your help to solve this issue

Replies are listed 'Best First'.
Re: Read table data from PDF
by hippo (Bishop) on May 11, 2016 at 09:54 UTC
Re: Read table data from PDF
by ateague (Monk) on May 11, 2016 at 13:50 UTC

    Do you have a (small) anonymized PDF sample we can look at? Is the PDF text even searchable and selectable in a PDF viewer?

    At $WORK I use pdftohtml with the following command line: pdftohtml.exe -xml -stdout -zoom 1.4 [PDF FILE]

    This will rip out all the text elements into an XML file with attributes for the font, x/y position on the page and text length. (-zoom 1.4 makes the positioning units 100 dpi, -stdout streams the output to STDOUT instead of writing it to a file).

    Here is an example of what I typically work with:

    <?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml> <page number="1" position="absolute" top="0" left="0" height="1100" wi +dth="850"> <fontspec id="0" size="17" family="Times" color="#000000"/> <text top="103" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="120" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="186" left="115" width="103" height="18" font="0">ROUTE TO: +</text> <text top="186" left="265" width="107" height="17" font="0">Audit Bil +ling</text> <text top="220" left="115" width="128" height="18" font="0">SORT GROU +P:</text> <text top="220" left="265" width="152" height="18" font="0">Invoice S +ort Group</text> <text top="286" left="115" width="260" height="18" font="0">OH_GOD_IT +_BURNS 2013-12-20</text> <text top="286" left="415" width="71" height="18" font="0">23:53:04</ +text> <text top="286" left="545" width="108" height="18" font="0">FOOBAR</t +ext> <text top="320" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> <text top="336" left="115" width="602" height="18" font="0">XXXXXXXXX +XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX</text> </page> /<pdf2html>

    I can then use XML::Twig with XPath expressions to pull the exact xml nodes I want:

    open (my $XML, "-|", "pdftohtml.exe -xml -zoom 1.4 -stdout $PDF_FILE") + or die "$!\n$^E"; # We are only interested in the text for the "ROUTE TO:" and "SORT + GROUP:" sections # Set the twig_handlers to extract the <text> nodes of interest; a +ll other nodes will be ignored # XPath queries provide an extra 1/20 inch padding on all sides to + take font and rendering variations into account my $t = XML::Twig->new( twig_handlers => { '//text[(@top >= 180 and @top <= 190) and (@left >= 100 an +d @left <= 111)]' => \&RouteTo, '//text[(@top >= 215 and @top <= 225) and (@left >= 260 an +d @left <= 270)]' => \&InvoiceSort, }, comments => 'drop', # remove any comments empty_tags => 'normal',# empty tags = <tag/> ); $t->parse($XML); $t->purge; close $XML;
Re: Read table data from PDF
by Ratazong (Monsignor) on May 11, 2016 at 10:44 UTC

    Hi perlmad,

    I am afraid that is a non-trivial task. To know why, please read the following node by almut: Re: CAM::PDF did't extract all pdf's content

    I made good experiences by using an external pdf2txt-converter and the parsing the output - but this of course depends on your input-document.

    HTH, Rata

      I made good experiences by using an external pdf2txt-converter and the parsing the output - but this of course depends on your input-document.

      As a side note, if you go down this route, make absolutely certain that your external program will extract the text with some sort of X/Y position.

      Unless you have full and complete control over the PDF and its generation, parsing PDF text by fixed position row/column is pretty much guaranteed to end in failure, frustration, and an absolutely massive nest of exceptions and special parsing cases

Re: Read table data from PDF
by marto (Cardinal) on May 11, 2016 at 09:55 UTC

      no marto , I don't know who is he

      I have checked all the ways to read pdf file but i Couldn't get anything , I missed to mention that the pdf file table contain background color and text color

      Is this possible to read only table data from pdf

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1162723]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-24 18:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found