Re: Re: text extract

As far as my project goes, I have been able to make a flowchart of what I need to do and started development. But for the reasons of dynamic and ever changing formats of HTML, I have narrowed my focus on around 10 educational web pages and extracting info from there. However being new to perl, I am confused as to how to use the modules effectively. I have read and understood the functionality and can perform the basic functions but the combination of regular expressions and the HTML::parse etc to extract only CERTAIN parts of tect from a page is where i keep getting stuck. Please help or advise what I should do. The code you gave works fine with the page i mentioned. Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code. These small hiccups are what are avoiding me from moving on fast. In the end i need GUI also which i think i can manage as Ive worked on it. Thanx...

Comment on Re: Re: text extract

Replies are listed 'Best First'.

Re: Re: Re: text extract
by Roger (Parson) on Feb 03, 2004 at 13:10 UTC

Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

use strict;
use warnings;
use Data::Dumper;

# build a hash of known patterns for each known web site
my %patterns = (
    'www.foo.com' => {
        start  => "<h3><font[^>]*><b><!--KEY--></b>",
        finish => "(?<!</font>\n)<br>",
    },

    'www.bar.com' => {
        start  => "...",
        finish => "...",
    },
);

my $html = do { local $/; <DATA> };

print ExtractSection($html, 'www.foo.com', 'Section 2'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 1'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 3'), "\n\n";

# -----------------------------------------------------

sub ExtractSection
{
    my ($html, $site, $section) = @_;

    my $ps = $patterns{$site}->{start};
    my $pf = $patterns{$site}->{finish};

    $ps =~ s/<!--KEY-->/$section/;
    $pf =~ s/<!--KEY-->/$section/;

    my ($text) = $html =~ /($ps.*?$pf)/sm;
    return $text;
}

__DATA__
<HTML>
<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>
<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>
<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>
</HTML>
[download]

<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>

<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>

<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>
[download]

[reply]
[d/l]
[select]


Just another Perl shrine
	PerlMonks