comment on

Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.

That's no moon, that's a space station -- Obiwan Kenobi.

To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)

I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...

use strict;
use warnings;
use Data::Dumper;

# build a hash of known patterns for each known web site
my %patterns = (
    'www.foo.com' => {
        start  => "<h3><font[^>]*><b><!--KEY--></b>",
        finish => "(?<!</font>\n)<br>",
    },

    'www.bar.com' => {
        start  => "...",
        finish => "...",
    },
);

my $html = do { local $/; <DATA> };

print ExtractSection($html, 'www.foo.com', 'Section 2'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 1'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 3'), "\n\n";

# -----------------------------------------------------

sub ExtractSection
{
    my ($html, $site, $section) = @_;

    my $ps = $patterns{$site}->{start};
    my $pf = $patterns{$site}->{finish};

    $ps =~ s/<!--KEY-->/$section/;
    $pf =~ s/<!--KEY-->/$section/;

    my ($text) = $html =~ /($ps.*?$pf)/sm;
    return $text;
}

__DATA__
<HTML>
<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>
<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>
<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>
</HTML>
[download]

And the output -

<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>

<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>

<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>
[download]

In reply to Re: Re: Re: text extract by Roger
in thread text extract by shu

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks