Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

•Re: Eliminating "duplicate" domains from a hash/array

by merlyn (Sage)
on Mar 30, 2003 at 22:06 UTC ( [id://246793]=note: print w/replies, xml ) Need Help??


in reply to Eliminating "duplicate" domains from a hash/array

You can't do it precisely programmatically. You have to determine that two pages are "close enough" when you hit them. I know Google knows to do that, but I've run into other web walkers that don't.

I found that out by putting a link on my webserver root page to -. And I symlinked "-" to "." in my root doc directory. So any page on my website was accessible by any number of /-/-/-/-/- prefix chars before the real URL. Google immediately figured it out, but I had other webcrawlers visiting (and indexing!) my entire web site some 15 or 20 times deep before giving up.

If you are spidering your own site, you can add code in your spider to canonicalize your URLs before fetching. I did that in a few of my columns.

-- Randal L. Schwartz, Perl hacker
Be sure to read my standard disclaimer if this is a reply.

Replies are listed 'Best First'.
Re: •Re: Eliminating "duplicate" domains from a hash/array
by pg (Canon) on Mar 31, 2003 at 02:59 UTC
    For the case of symlink '-' to '.', that is obviously a kind of problem that can be resolved precisely programmatically. The fact that Google can resolve it, clearly shows this is resolvable; The fact that others cannot deliver the same thing, only means their programs are not smart enough.

    We have to clearly identify what is logically doable, and what is not. Something nobody handles or somebody handled badly does not necessary to be logically unresolvable.

    The actual difficulty to compare URL's, has really nothing to do with this kind of small trick, which is obviously logically and programmatically resolvable.

    The real problem is that, the solution to this kind of issue is largely related to the internal structure of each particular site, which is not regulate by any standard, and could be so different from site to site.

    We have to realize/remember that no search engine is just a set of programs, instead it is a set of programs + MANUALLY MAINTAINED INFOS. Without those MANUALLY MAINTAINED INFOS, there is no google or any other search engine.
Re: •Re: Eliminating "duplicate" domains from a hash/array
by bsb (Priest) on Mar 31, 2003 at 11:21 UTC
    I'm really curious, why were you doing this?

    Brad

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://246793]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2024-03-28 12:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found