You could look at
to write your own robot. | [reply] |
they use a porgram called spider that looks up for a file called "robots.txt" in the root dir of the specified (registered) domain. So they try to get "http://www.perlmonks.org/robots.txt") Inside this file is specified what the spider may read and what directories he might search. But still not. Acces must be granted.
You can read about this very good on the special section of these search engines, where is described how such a file must and can look like and how it works.
CPAN has it
Have a nice day
All decision is left to your taste | [reply] |
... and what a spider does is that it looks at a specified
page, and finds all the links on it. It then follows all of
those links (at least the ones local to the site), and downloads
them as well. you keep repeating this process until you have
all the pages (min you have to keep track of which pages
you have to prevent an infinite loop). But there are
already module for that (though sometimes its nice to
know what your modules are doing).
- Ant | [reply] |