http://qs321.pair.com?node_id=252312

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.

How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?

http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar
I pass these into my @links array, and remove the dupes with the following:
my @pri = grep {!$seen{$_} ++} @links;

The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.

Should I split on the '#' there, and fetch everything to the left of it?

What if someone decides to put the '#' in a query string? Is that possible?

Replies are listed 'Best First'.
•Re: Avoid "duplicate" fetching with LWP
by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC
Re: Avoid "duplicate" fetching with LWP
by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC
    I would do just what you propose - split on the '#' and fetch everything to the left. That is, split on the '#' receiving into a list, and then fetch on the 1st element of the list, like:
    for $link (@links) { @link_tokens = split("#", $link); push(@pri, $link_tokens[0]); }
    that should handle those rare cases where someone puts a '#' sign in the query string - I think(?) you only care about the part of the link before the 1st '#' sign, right?

    HTH.
Re: Avoid "duplicate" fetching with LWP
by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC

    How about:

    use Data::Dumper; my @uris = qw( http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar ); my %seen; my @unique_uris = grep !$seen{$_}++, map /^([^?#]+)/, @uris; print Dumper(\@unique_uris);

    That would catch everything to the left of a # (anchor) and ? (query) character (if there is one), which I believe is what you want.

    Hope that helps.