Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

technical with IPs

by Anonymous Monk
on Oct 24, 2006 at 17:50 UTC ( [id://580342]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This isn't completely OT..

I have a picture site built in CGI that is so popular that a lot of people use their bots to surf through the pages and steal all the pics.

I am not much of a technical person and won't pretend to know much about anything, but the question is: can I SSI a perl script that can check IP addresses of each page caller and if two IPS with the first three octects (but a different final octect), block them from the site for 15 minutes?

AFAIK, two people with the same three IP octects being two unique people would be 1 in 1 million chances so I assume I can be rest assured it's someone's bought trying to scrape my pages.

Or can't I do this? Would I potentially be declining traffic from legit sources?

Replies are listed 'Best First'.
Re: technical with IPs
by samtregar (Abbot) on Oct 24, 2006 at 18:12 UTC
    AFAIK, two people with the same three IP octects being two unique people would be 1 in 1 million chances so I assume I can be rest assured it's someone's bought trying to scrape my pages.

    This is an invalid assumption. Any two people from the same ISP (AOL, Time Warner, etc.) will be quite likely to have the same first three octets in their IP. I'm not sure why you're trying to determine if two people from the same IP-block are on your site at the same time - this seems unrelated to bots scanning your site.

    Instead, I think you should look into more generic rate-limiting techniques. For example, if you're using CGI::Application you can use CGI::Application::Plugin::RateLimit to limit how fast people can access your site.

    -sam

Re: technical with IPs
by Callum (Chaplain) on Oct 24, 2006 at 18:17 UTC
    If you want to block IPs then block them at the webserver level rather than as the page is served.

    Even if you block a specific IP address rather than a range you're potentially blocking "legit sources", though obviously you're more likely to blog legit traffic if you're blocking a range.

    Your assumption that two people sharing the first three octets of their IP address are the same (to 1 in a million) is highly flawed -- most people's IP addresses come from their ISP, company, university etc, and many will be coming through a proxy server -- blocking even a single IP is potentialy going to hit "innocent" users. minor edit on method

Re: technical with IPs
by blue_cowdawg (Monsignor) on Oct 24, 2006 at 18:24 UTC
        AFAIK, two people with the same three IP octects being two unique people would be 1 in 1 million chances so I assume I can be rest assured it's someone's bought trying to scrape my pages.

    As samtregar points out that's a very bad assumption. I'm thinking of NAT'ed systems behind a firewall as well. If you had two people coming from the same university for instance there's a great potential they will the same IP address never mind the first three octets.


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
      This is very common in the business world as well. Each of the last three companies I worked for has had one, two, or three proxy servers that are the source of all the HTTP requests; one company had over five thousand users behind a single HTTP proxy address.
Re: technical with IPs
by chargrill (Parson) on Oct 24, 2006 at 18:42 UTC

    For a little more discussion (that also has some apache configuration ideas) see: blocking site scrapers



    --chargrill
    s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
Re: technical with IPs
by ikegami (Patriarch) on Oct 24, 2006 at 18:38 UTC

    nimdokk provided very good advice.

    Additionally or indepently, using a honeypot is simple and effective. In your pages, place a link no user will ever click on (or even see). Anyone who follows that link is a robot. Any further request from that IPsession can be redirected to an error page.

    In case the user uses a web accelerator that prefetched the honeypot, the error page should provide the means for the user to validate himself as a person. Captchas provide such a mean.

      As noted in the other posts, you don't want to block based on IP. One bad user behind a proxy (or NAT'd) shouldn't block all other users comming through the same proxy.

      I do like the idea of a honeypot link though :)

Re: technical with IPs
by nimdokk (Vicar) on Oct 24, 2006 at 18:26 UTC
    What you might look at would be something that analyzed the traffic coming in, if you get 20 hits in a matter of seconds - perhaps temporarily block that particular IP address for a minute. I have no idea where you'd start with something like that. A bot will (likely) be doing something systematic and (reasonably) predicatble. It might help to try to limit (but not ban) that activity so you don't penalize legitimate usage.

    Just my 2 bits :-)

      You'd have to be careful with this - if it's a picture gallery, a single user will appear to "hit" the site 1+N(number of pictures) time within a very short period of time.



      --chargrill
      s**lil*; $*=join'',sort split q**; s;.*;grr; &&s+(.(.)).+$2$1+; $; = qq-$_-;s,.*,ahc,;$,.=chop for split q,,,reverse;print for($,,$;,$*,$/)
        Very true.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://580342]
Approved by samtregar
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-03-28 11:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found