Avoid "duplicate" fetching with LWP

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.

How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?

   http://www.foo.bar/index.html
   http://www.foo.bar/index.html#foo
   http://www.foo.bar/index.html#bar
[download]

I pass these into my @links array, and remove the dupes with the following:

   my @pri = grep {!$seen{$_} ++} @links;
[download]

The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.

Should I split on the '#' there, and fetch everything to the left of it?

What if someone decides to put the '#' in a query string? Is that possible?

Comment on Avoid "duplicate" fetching with LWP Select or Download Code

Replies are listed 'Best First'.
•Re: Avoid "duplicate" fetching with LWP by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC
Parse the strings with URI, set the fragment to empty, then save the result as a string. I have an example of that in a column of mine. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Avoid "duplicate" fetching with LWP by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC
I would do just what you propose - split on the '#' and fetch everything to the left. That is, split on the '#' receiving into a list, and then fetch on the 1st element of the list, like: `for $link (@links) { @link_tokens = split("#", $link); push(@pri, $link_tokens[0]); }` [download] that should handle those rare cases where someone puts a '#' sign in the query string - I think(?) you only care about the part of the link before the 1st '#' sign, right? HTH.	[reply] [d/l]
Re: Avoid "duplicate" fetching with LWP by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC
How about: `use Data::Dumper; my @uris = qw( http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar ); my %seen; my @unique_uris = grep !$seen{$_}++, map /^([^?#]+)/, @uris; print Dumper(\@unique_uris);` [download] That would catch everything to the left of a # (anchor) and ? (query) character (if there is one), which I believe is what you want. Hope that helps.	[reply] [d/l]

Back to Seekers of Perl Wisdom