RFC: Module for extracting data from generated HTML pages

Oh Wise Ones,

In a sudden burst of sanity i created a module that can extract data from a bunch of generated or similar HTML pages. It's a bit template-like but in the wrong direction.

To use it, one has to edit the "template" file and replace any value they want with [% name_of_the_value %]. The module then returns a ref to a hash with these names and their corresponding values in the second document.

Edit: I just found out about Template::Extract so this can be moved to /dev/null

Example:
Lets's say i have a bunch of html documents that all look kinda like this:

<html>
<head>
<title>Mammals</title>
</head>
<body>
<h1>Mammals</h1>
<h2 id="1">Monkeys</h2>
</body>
</html>
[download]

Now i want to extract certain values from that html document.

From the html document i create a template that looks like this:

<html>
<head>
<title>[% title %]</title>
</head>
<body>
<h1>Mammals</h1>
<h2 id="[% myidentifier %]">[% animal %]</h2>
</body>
</html>
[download]

Now this piece of code:

#!/usr/bin/perl

use strict;
use warnings;

use ExtractDiff;
use File::Slurp;

my $template = read_file('template.html');
my $document = read_file('document.html');
my $resultRef = ExtractDiff::getValues(\$template, \$document);
foreach (keys %$resultRef)
{
        print "$_: $$resultRef{$_}\n";
}
[download]

Would produce this:

myidentifier: 1
animal: Monkeys
title: Mammals
[download]

The actual code is this:

package ExtractDiff;

use strict;
use warnings;
use Algorithm::Diff qw(sdiff);
use Data::Dumper;

sub getValues
{
        my $template = shift;
        my $document = shift;
        my %result;
        foreach my $item (sdiff(splitFile($template), splitFile($docum
+ent)))
        {
                        if (($item->[0] eq 'c') && ($item->[1] =~ m/\[
+\%\s*(.+?)\s*\%\]/))
                        {
                                my $name = $1;
                                my $templateString = $item->[1];
                                my $documentString = $item->[2];
                                if ($templateString =~ m/^(.*?)\[\%.*?
+\%\](.*?)$/)
                                {
                                        my $prefix = $1;
                                        my $postfix = $2;
                                        if ($documentString =~ m/^\Q$p
+refix\E(.*)\Q$postfix\E$/)
                                        {
                                                #print "$name: $1\n";
                                                $result{$name} = $1;
                                        }
                                }
                        }
        }
        return \%result;
}

sub splitFile
{
        my $ref = shift;
        my @file;
        push (@file, grep { $_ } split(/\s*(<.+?>)\s*/, $$ref));
        return \@file;
}

1;
[download]

Does anybody have any comments on this? Is it handy enough to put on CPAN? What would be a good name?

Comment on RFC: Module for extracting data from generated HTML pages Select or Download Code

Replies are listed 'Best First'.
Re: RFC: Module for extracting data from generated HTML pages by gellyfish (Monsignor) on Jul 26, 2005 at 13:03 UTC
To be honest in the first instance I would suggest that you have a discussion with the author of Template::Extract to see if some of the features that you find that module lacks and you are trying to provide in your module can be provided, you mean even want to provide a set of patches that implement these. /J\	[reply]


Perl-Sensitive Sunglasses
	PerlMonks