http://qs321.pair.com?node_id=478157

Oh Wise Ones,

In a sudden burst of sanity i created a module that can extract data from a bunch of generated or similar HTML pages. It's a bit template-like but in the wrong direction.

To use it, one has to edit the "template" file and replace any value they want with [% name_of_the_value %]. The module then returns a ref to a hash with these names and their corresponding values in the second document.

Edit: I just found out about Template::Extract so this can be moved to /dev/null


Example:
Lets's say i have a bunch of html documents that all look kinda like this:
<html> <head> <title>Mammals</title> </head> <body> <h1>Mammals</h1> <h2 id="1">Monkeys</h2> </body> </html>
Now i want to extract certain values from that html document.
From the html document i create a template that looks like this:
<html> <head> <title>[% title %]</title> </head> <body> <h1>Mammals</h1> <h2 id="[% myidentifier %]">[% animal %]</h2> </body> </html>
Now this piece of code:
#!/usr/bin/perl use strict; use warnings; use ExtractDiff; use File::Slurp; my $template = read_file('template.html'); my $document = read_file('document.html'); my $resultRef = ExtractDiff::getValues(\$template, \$document); foreach (keys %$resultRef) { print "$_: $$resultRef{$_}\n"; }
Would produce this:
myidentifier: 1 animal: Monkeys title: Mammals
The actual code is this:
package ExtractDiff; use strict; use warnings; use Algorithm::Diff qw(sdiff); use Data::Dumper; sub getValues { my $template = shift; my $document = shift; my %result; foreach my $item (sdiff(splitFile($template), splitFile($docum +ent))) { if (($item->[0] eq 'c') && ($item->[1] =~ m/\[ +\%\s*(.+?)\s*\%\]/)) { my $name = $1; my $templateString = $item->[1]; my $documentString = $item->[2]; if ($templateString =~ m/^(.*?)\[\%.*? +\%\](.*?)$/) { my $prefix = $1; my $postfix = $2; if ($documentString =~ m/^\Q$p +refix\E(.*)\Q$postfix\E$/) { #print "$name: $1\n"; $result{$name} = $1; } } } } return \%result; } sub splitFile { my $ref = shift; my @file; push (@file, grep { $_ } split(/\s*(<.+?>)\s*/, $$ref)); return \@file; } 1;
Does anybody have any comments on this? Is it handy enough to put on CPAN? What would be a good name?