perlquestion
Limbic~Region
All,
<br />
Typically when I need to extract data from a PDF, I just convert it to text and apply some regex fu on the text. This approach is not effective for my current project due to the page layout. I was hoping to use the <i>traverse()</i> method to create a node walker akin to [mod://HTML::TokeParser].
<ul>
<li>Consume a node</li>
<li>Determine node type</li>
<li>Determine current state of parse</li>
<li>Dispatch a handler for the node based on type and current state</li>
</ul>
<p>
I have done a fair amount of searching and came across two hints of a solution at [http://stackoverflow.com/questions/745138/how-do-i-get-text-orientation-of-a-text-string-in-a-pdf-page-using-campdf|Stack] [http://stackoverflow.com/questions/641427/how-do-i-know-if-pdf-pages-are-color-or-black-and-white|Overflow] by the author of [mod://CAM::PDF]. I have also emailed the author though I imagine he is quite busy actually having a life.
</p>
<p>
Obviously, I am not looking for someone to write the parser for me but does anyone have a more generic (non-specific) example of using <i>traverse()</i>? Below is an example of how I create a parser using [mod://HTML::TokeParser]
</p>
<c>
# Step 1: Dump the entire document
while (my $tok = $p->get_token) {
print Dumper($tok);
}
</c>
<p>
I then edit the dumped document searching for the piece of information I want to extract. Perhaps it is identified by a certain id or name tag. Then, I can start to construct my parser:
</p>
<c>
use constant TYPE => 0;
use constant TEXT => 1;
use constant TAG => 2;
use constant ATTR => 3;
while (my $tok = $p->get_token) {
next if $tok->[TYPE] ne 'S' || $tok->[TAG] ne 'b' || ! $tok->[ATTR]{class};
next if $tok->[ATTR]{class} ne 'secret';
my $next = $p->get_token;
$wanted{password} = trim($next->[TEXT]);
last;
}
</c>
In other words, once I understand the internal structure of the HTML document, I can find the data I am looking for.
<div class="pmsig"><div class="pmsig-180961">
<p>
Cheers - [Limbic~Region|L~R]
</p>
</div></div>