Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Retreive, modify, & display webpage

by AidanLee (Chaplain)
on Jan 03, 2002 at 19:57 UTC ( [id://135995]=note: print w/replies, xml ) Need Help??


in reply to Retreive, modify, & display webpage

I'll re-post this here as the other post is marked as the duplicate

What you'll probably want to do is Walk through the bulleted list and for each bullet:

  1. Pull off the first line of text (the name)
  2. Then get the link from the link(s?) that comes after that bullet, but before the next.

It may be difficult to do with TokeParser since the generated page doesn't close their list-element ( <li>) tags, and I don't know what it can or can't handle. If it does not work, as much as It's usually unwise to advocate it, since you have a "known format" you're working with, it would be possible to parse this page with regular expressions:

my @document = split /\n/, $document; my $entry = ''; foreach ( @document ) { m|^<li>(.*?)</strong>| and do { $entry = $1; next }; m|<a href=(.*?)>(.*?)</a>| and do { my $url = $1; $url =~ s/CMD=TABLES/CMD=RET/; my $text = $2; if ($text eq "STF1A" || $text eq "STF3A") { print OUTPUT "<a href=$url/FMT=HTML/T=P1>$entry $text</a>< +br />\n"; } next; }; }

Replies are listed 'Best First'.
Re: Re: Retreive, modify, & display webpage
by Sang (Acolyte) on Jan 03, 2002 at 23:33 UTC
    Aidan: Thanks for the reply, I'm currently trying the regex approach and ran into a little bump. The pattern for grabbing the link's text description...
    m|<a href=(.*?)>(.*?)</a>|
    ...will grab everything but "STF3A" and "STF1A". Given:
    Browse Tiger <a href="http://tiger.census.gov/cgi-bin/mapbrowse-tbl?la +t=36.12000 &lon=-95.94135&wid=0.75&ht=0.75&mlat=36.12000&mlon=-95.94135&msym=redp +in&off=CIT IES&mlabel=Tulsa+County,+OK">Map</a> of area.<br>
    $text will hold "Map" but when given:
    Lookup 1990 Census <a href=http://venus.census.gov/cdrom/lookup/CMD=TA +BLES/DB=C9 0STF1A/F0=FIPS.STATE/F1=FIPS.COUNTY90/F2=STUB.GEO/LEV=COUNTY90/SEL=40, +143,Tulsa+ County>STF1A</a>
    $text is empty...I've tried tweaking the pattern but I'm even more of a newbie with regex than I am with perl, any suggestions?
      If STF1A and STF3A are the only two strings you'll ever want to match you might consider changing it to this:
      m|<a href=(.*?)>(STF1A|STF3A)</a>|
      But that won't necessarily address why it isn't matching. If the urls you're parsing are broken on multiple lines like that you'll need to add the 's' modifier so that the .*? will match newlines as well:
      m|<a href=(.*?)>(STF1A|STF3A)</a>|
      HTH
        I figured it out actually, I don't understand it, but I figured it out. If you move the
        $url =~ s/CMD=TABLES/CMD=RET/;
        line into the if block works fine, if you don't it will only grab the "Map" links.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://135995]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-23 08:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found