Example Of Using CAM::PDF Like HTML::TokeParser

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.

Consume a node
Determine node type
Determine current state of parse
Dispatch a handler for the node based on type and current state

I have done a fair amount of searching and came across two hints of a solution at Stack Overflow by the author of CAM::PDF. I have also emailed the author though I imagine he is quite busy actually having a life.

Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using traverse()? Below is an example of how I create a parser using HTML::TokeParser

# Step 1:  Dump the entire document
while (my $tok = $p->get_token) {
    print Dumper($tok);
}
[download]

I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser:

use constant TYPE => 0;
use constant TEXT => 1;
use constant TAG  => 2;
use constant ATTR => 3;
while (my $tok = $p->get_token) {
    next if $tok->[TYPE] ne 'S' || $tok->[TAG] ne 'b' || ! $tok->[ATTR
+]{class};
    next if $tok->[ATTR]{class} ne 'secret';
    my $next = $p->get_token;
    $wanted{password} = trim($next->[TEXT]);
    last;
}
[download]

In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for.

Cheers - L~R

Comment on Example Of Using CAM::PDF Like HTML::TokeParser Select or Download Code

Replies are listed 'Best First'.
Re: Example Of Using CAM::PDF Like HTML::TokeParser by Khen1950fx (Canon) on Oct 08, 2011 at 23:25 UTC
Would something like this help? I'm still trying to get a handle on it, and this is what I have so far. #!/usr/bin/perl use strict; use warnings; use Devel::SimpleTrace; use CAM::PDF; use CAM::PDF::Content; use CAM::PDF::PageText; use Data::Dumper::Concise; my $file = '/root/Desktop/sample1.pdf'; binmode STDOUT, ":encoding(utf8)"; my $pdf = CAM::PDF->new($file); for my $pagenum(1 .. $pdf->numPages) { my $contentTree = $pdf->getPageContentTree($pagenum) or next; $contentTree->validate() or die $@; print Dumper($contentTree->render('CAM::PDF::Renderer::Dump')); $pdf->setPageContent(2,$pagenum); last; } [download]	[reply] [d/l]
Re^2: Example Of Using CAM::PDF Like HTML::TokeParser by Limbic~Region (Chancellor) on Oct 11, 2011 at 00:42 UTC
Khen1950fx, In short, yes. I am still playing but this was a significant step in the right direction. Please let me know what else you come up with. Cheers - L~R	[reply]
Re^3: Example Of Using CAM::PDF Like HTML::TokeParser by Khen1950fx (Canon) on Oct 12, 2011 at 06:36 UTC
Here's what I have now. I borrowed hdump from the examples directory of HTML::Parser. Then I used CAM::PDF::GS to make a gs log file. `#!/usr/bin/perl use strict; use warnings; use CAM::PDF; use Data::Dumper::Concise; use base qw(CAM::PDF::GS::NoText); my $file = shift @ARGV; my $log = '/root/Desktop/gs.log'; binmode STDOUT, ":encoding(utf8)"; open STDOUT, '>', $log; my $pdf = CAM::PDF->new($file); my $contentTree = $pdf->getPageContentTree(5); my $gs = $contentTree->computeGS; print Dumper($gs): close STDOUT;` [download] From the cmdline do `perl gscript.pl /path/to/pdf` [download] Then I used hdump to examine gs.log: #!/usr/bin/perl -w use strict; use HTML::TokeParser; use Data::Dumper::Concise; $\| = 1; sub h { my ( $event, $line, $column, $text, $tagname, $attr ) = @_; my (@d) = uc( substr( $event, 0, 1 ) ) . " L$line C$column"; substr( $text, 40 ) = "..." if length $text > 40; push @d, $text; push @d, $tagname if defined $tagname; push @d, $attr if $attr; print Dumper(@d); } my $p = HTML::Parser->new( api_version => 3 ); $p->handler( default => \&h, "event, line, column, text, tagname, attr +" ); $p->parse_file( @ARGV ? shift : *STDIN ); [download] From the cmdline: `perl hdump /path/to/gs.log` [download] I hope that it's useful for you.	[reply] [d/l] [select]
Re: Example Of Using CAM::PDF Like HTML::TokeParser by pvaldes (Chaplain) on Oct 08, 2011 at 21:37 UTC
ok then, `$po->traverse(1, $a_node_name, $function, $somedata);` the first field after traverse is 1 (traverse this node) or 0 (don't do this, threat this link as "dead") The second field is the node name to apply The third is an action to do when you pass through this node, you can use here as argument several functions provided with the module. `(i.e \&_changeRefKeysCB, \&_abbrevInlineImageCB, \&_changeStringCB or \&_getRefListCB)` and fourth field is the data implied in this action `(i.e $im_a_list)` Hope this helps, bye	[reply] [d/l] [select]
Re: Example Of Using CAM::PDF Like HTML::TokeParser by pvaldes (Chaplain) on Oct 08, 2011 at 16:12 UTC
if the pdf layout is the problem, maybe you want consider to use pdftotext playing a little with the layout option, `pdftotext -layout file.pdf file.txt`; `pdftotext file.pdf second_file.txt`; [download] you can also extract only the desired pages of the pdf instead the whole file, making the search more easy	[reply] [d/l]
Re^2: Example Of Using CAM::PDF Like HTML::TokeParser by Limbic~Region (Chancellor) on Oct 08, 2011 at 19:31 UTC
pvaldes, As I indicated in my original post, extracting the text didn't work. What I didn't indicate is that I tried every possible tool and variation I could think of to include commercial products. None of the text extractions produce a consistent enough format for me to get at what I need. I understand that what I want to do is not ideal nor easy am may be futile - I however would like to try for myself. Cheers - L~R	[reply]
Re: Example Of Using CAM::PDF Like HTML::TokeParser by Anonymous Monk on Oct 11, 2011 at 12:36 UTC
Can you use XPath expressions to zero-in more directly on the particular nodes you're looking for? "Writing programmed logic" to navigate an XML or HTML tree is akin to writing a recursive-descent compiler by hand instead of using YACC.	[reply]
Re^2: Example Of Using CAM::PDF Like HTML::TokeParser by Limbic~Region (Chancellor) on Oct 11, 2011 at 13:17 UTC
Anonymous Monk, If you are referring to the non-existant PDF parser that this thread is about, then no. The internal structure of a PDF wouldn't lend itself to XPath diving. If you are referring to the way I go about creating an parser using HTML::TokeParser then the answer is "it depends". Node traversal is usually the last tool in the box I reach for. I am not even opposed to using regular expressions (gasp) if each page is consistent enough. It all depends on how consistent one page is to the next. Cheers - L~R	[reply]
Re: Example Of Using CAM::PDF Like HTML::TokeParser by thargas (Deacon) on Oct 11, 2011 at 18:53 UTC
You may want to look at CAM::PDF::Renderer::Text. Although I'm sure you're not interested in its output format, it might be interesting as an example of getting the basic text/location info. You could use that and wire in your own functions to figure out what you want.	[reply]

Back to Seekers of Perl Wisdom