Re: Retreive, modify, & display webpage

I'll re-post this here as the other post is marked as the duplicate

What you'll probably want to do is Walk through the bulleted list and for each bullet:

Pull off the first line of text (the name)
Then get the link from the link(s?) that comes after that bullet, but before the next.

It may be difficult to do with TokeParser since the generated page doesn't close their list-element ( <li>) tags, and I don't know what it can or can't handle. If it does not work, as much as It's usually unwise to advocate it, since you have a "known format" you're working with, it would be possible to parse this page with regular expressions:

my @document = split /\n/, $document;

my $entry = '';

foreach ( @document ) {

    m|^<li>(.*?)</strong>| and do { $entry = $1; next };

    m|<a href=(.*?)>(.*?)</a>| and do {

        my $url = $1;
        $url =~ s/CMD=TABLES/CMD=RET/;

        my $text = $2;
        if ($text eq "STF1A" || $text eq "STF3A") {
            print OUTPUT "<a href=$url/FMT=HTML/T=P1>$entry $text</a><
+br />\n";
        }

        next;
    };
}
[download]

Comment on Re: Retreive, modify, & display webpage Select or Download Code

Replies are listed 'Best First'.
Re: Re: Retreive, modify, & display webpage by Sang (Acolyte) on Jan 03, 2002 at 23:33 UTC
Aidan: Thanks for the reply, I'm currently trying the regex approach and ran into a little bump. The pattern for grabbing the link's text description... `m\|<a href=(.?)>(.?)</a>\|` [download] ...will grab everything but "STF3A" and "STF1A". Given: `Browse Tiger <a href="http://tiger.census.gov/cgi-bin/mapbrowse-tbl?la +t=36.12000 &lon=-95.94135&wid=0.75&ht=0.75&mlat=36.12000&mlon=-95.94135&msym=redp +in&off=CIT IES&mlabel=Tulsa+County,+OK">Map</a> of area.<br>` [download] $text will hold "Map" but when given: `Lookup 1990 Census <a href=http://venus.census.gov/cdrom/lookup/CMD=TA +BLES/DB=C9 0STF1A/F0=FIPS.STATE/F1=FIPS.COUNTY90/F2=STUB.GEO/LEV=COUNTY90/SEL=40, +143,Tulsa+ County>STF1A</a>` [download] $text is empty...I've tried tweaking the pattern but I'm even more of a newbie with regex than I am with perl, any suggestions?	[reply] [d/l] [select]
Re: Re: Re: Retreive, modify, & display webpage by AidanLee (Chaplain) on Jan 04, 2002 at 00:21 UTC
If STF1A and STF3A are the only two strings you'll ever want to match you might consider changing it to this: `m\|<a href=(.?)>(STF1A\|STF3A)</a>\|` [download] But that won't necessarily address why it isn't matching. If the urls you're parsing are broken on multiple lines like that you'll need to add the 's' modifier so that the .? will match newlines as well: `m\|<a href=(.*?)>(STF1A\|STF3A)</a>\|` [download] HTH	[reply] [d/l] [select]
Re: Re: Re: Re: Retreive, modify, & display webpage by Sang (Acolyte) on Jan 04, 2002 at 00:45 UTC
I figured it out actually, I don't understand it, but I figured it out. If you move the `$url =~ s/CMD=TABLES/CMD=RET/;` [download] line into the if block works fine, if you don't it will only grab the "Map" links.	[reply] [d/l]


XP is just a number
	PerlMonks