Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
All,
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the traverse() method to create a node walker akin to HTML::TokeParser.
- Consume a node
- Determine node type
- Determine current state of parse
- Dispatch a handler for the node based on type and current state
I have done a fair amount of searching and came across two hints of a solution at Stack Overflow by the author of CAM::PDF. I have also emailed the author though I imagine he is quite busy actually having a life.
Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using traverse()? Below is an example of how I create a parser using HTML::TokeParser
# Step 1: Dump the entire document while (my $tok = $p->get_token) { print Dumper($tok); }
I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser:
In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for.use constant TYPE => 0; use constant TEXT => 1; use constant TAG => 2; use constant ATTR => 3; while (my $tok = $p->get_token) { next if $tok->[TYPE] ne 'S' || $tok->[TAG] ne 'b' || ! $tok->[ATTR +]{class}; next if $tok->[ATTR]{class} ne 'secret'; my $next = $p->get_token; $wanted{password} = trim($next->[TEXT]); last; }
Cheers - L~R
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by Khen1950fx (Canon) on Oct 08, 2011 at 23:25 UTC | |
by Limbic~Region (Chancellor) on Oct 11, 2011 at 00:42 UTC | |
by Khen1950fx (Canon) on Oct 12, 2011 at 06:36 UTC | |
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by pvaldes (Chaplain) on Oct 08, 2011 at 21:37 UTC | |
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by pvaldes (Chaplain) on Oct 08, 2011 at 16:12 UTC | |
by Limbic~Region (Chancellor) on Oct 08, 2011 at 19:31 UTC | |
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by Anonymous Monk on Oct 11, 2011 at 12:36 UTC | |
by Limbic~Region (Chancellor) on Oct 11, 2011 at 13:17 UTC | |
Re: Example Of Using CAM::PDF Like HTML::TokeParser
by thargas (Deacon) on Oct 11, 2011 at 18:53 UTC |
Back to
Seekers of Perl Wisdom