Example Of Using CAM::PDF Like HTML::TokeParser

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.

Consume a node
Determine node type
Determine current state of parse
Dispatch a handler for the node based on type and current state

I have done a fair amount of searching and came across two hints of a solution at Stack Overflow by the author of CAM::PDF. I have also emailed the author though I imagine he is quite busy actually having a life.

Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using traverse()? Below is an example of how I create a parser using HTML::TokeParser

# Step 1:  Dump the entire document
while (my $tok = $p->get_token) {
    print Dumper($tok);
}
[download]

I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser:

use constant TYPE => 0;
use constant TEXT => 1;
use constant TAG  => 2;
use constant ATTR => 3;
while (my $tok = $p->get_token) {
    next if $tok->[TYPE] ne 'S' || $tok->[TAG] ne 'b' || ! $tok->[ATTR
+]{class};
    next if $tok->[ATTR]{class} ne 'secret';
    my $next = $p->get_token;
    $wanted{password} = trim($next->[TEXT]);
    last;
}
[download]

In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for.

Cheers - L~R

Back to Seekers of Perl Wisdom