Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Stream of Consciousness

by gjb (Vicar)
on Jun 28, 2003 at 20:06 UTC ( [id://269926]=note: print w/replies, xml ) Need Help??


in reply to Stream of Consciousness

Uhm, you may want to be careful with this. Google doesn't like user agents.

A while ago when a worked for a company we had a prototype running that used Google's web search capabilities. Since it was just a prototype, we didn't bother to contact Google about it until we were sure we'd start using it in production.

One morning I got to work, and my colleagues were complaining about the network: they couldn't reach Google, so apparently there was some wrong somewhere. I got a little worried and started running some tests and found out pretty soon that only Google was unreachable.

Yeah, right. They'd noticed that some automated user agents was submitting queries and had blocked that IP-address. Unfortunately all the company's internet traffic was routed via one and the same proxy, and its IP address was blacklisted.

I had to write a very humble letter to Google to request them to please take that address from their blacklist. Fortunately Google is not mission-critical for that company and I was backed by my boss who knew quite well what I was doing, but it earned me a certain reputation nevertheless ;-)

To return to the facts, I think you should have a look at their "terms of use" document, it clearly states that they don't like what your script is doing.

Best regards, -gjb-

Replies are listed 'Best First'.
Re: Re: Stream of Consciousness
by beretboy (Chaplain) on Jun 29, 2003 at 17:37 UTC
    How can they tell if it's an automated request?

    "Sanity is the playground of the unimaginative" -Unknown
      By pattern. A human being probably won't be submitting requests at the exact same interval for a long series of requests. Or he won't be submitting requests for 48 hours straight, or every 2 ms. In other words, human requests should have less regularity and volumn.

      It was probably mostly the volume since I had a random number generator for intervals (not to disguise, just out of courtesy). It was not exactly a lot of traffic (compared to what they receive anyway, but it must have been too long and (still) too regular.

      Oh well, best regards, -gjb-

        Automated requests may lack browser identification as well (if request to webpage, not web service).

        Of course, a best randomized disguise would be randomizing across internet domains but that won't be practical for most (law-abiding) people. Temporal randomization might not mean much if the time interval is too short, especially when shorter than the sampling intervals in which the series are analyzed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://269926]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-03-29 15:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found