Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I'm fetching links from a webpage via LWP, LWP::UserAgent, and HTML::LinkExtor, and I've run into something I can't figure out.
How do I avoid fetching "duplicate" links, which are actually fragments at the end of a valid page in my queue? Is there a way to tell that the following three links are the same page, minus the fragment?
I pass these into my @links array, and remove the dupes with the following:http://www.foo.bar/index.html http://www.foo.bar/index.html#foo http://www.foo.bar/index.html#bar
my @pri = grep {!$seen{$_} ++} @links;
The problem is that the "uniqueness" is on a stringification level, not at the URI level, so fragments which differ make the URL seen as unique. I'd much rather prefer not to fetch the same page 20 times for a link which appears once, with 20 fragments on it.
Should I split on the '#' there, and fetch everything to the left of it?
What if someone decides to put the '#' in a query string? Is that possible?
|
---|
Replies are listed 'Best First'. | |
---|---|
•Re: Avoid "duplicate" fetching with LWP
by merlyn (Sage) on Apr 22, 2003 at 16:52 UTC | |
Re: Avoid "duplicate" fetching with LWP
by hmerrill (Friar) on Apr 22, 2003 at 17:05 UTC | |
Re: Avoid "duplicate" fetching with LWP
by perlguy (Deacon) on Apr 22, 2003 at 18:08 UTC |
Back to
Seekers of Perl Wisdom