Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Fetching data from a corporate websites using LWP

by Poblachtach32 (Acolyte)
on Aug 02, 2002 at 08:57 UTC ( [id://187023]=perlmeditation: print w/replies, xml ) Need Help??

I was considering using the LWP funtion "get" to retrieve information from a corporate website. I was planning on accessing the page every 2 hours, which is 12 accesses a day. Is this wrong ethically? The information I plan to access is "public" information, in that it can be viewed with a browser anyway. But I want to use perl to correllate this data. Would you guys be discusted with someone who would use this technique? or is it "OK" as long as I'm not hogging the server, by accessing the page every few minute. Also, is it "wrong" for my script to pretend its Netscape in order to recieve the page in a Netscape compatible form?

Replies are listed 'Best First'.
Re: Fetching data from a corporate websites using LWP
by Abigail-II (Bishop) on Aug 02, 2002 at 12:25 UTC
    It's hard to say. Did you ask the owners from the site? You shouldn't also forget that nowadays almost all sites contain advertisement, which is either the main reason for the site, or to help the site sustain itself.

    Sure, your 12 hits a day won't have much impact. But what if it becomes common place? What if the majority of the people here used LWP to access perlmonks, decimating the ad hits?

    I've written my share of LWP scripts, and I've written ad busting proxies. But I'm not convinced everything I did was ethical. I'd say it depends on the site, and the views of the owners of the site.

    I don't think it's wrong for your script to pretend it's Netscape, but I do think it's wrong to ignore a robots.txt file. If the site would have a policy against automated harvesting, bypassing the policy would certainly be "wrong".

    Abigail

Re: Fetching data from a corporate websites using LWP
by Ryszard (Priest) on Aug 02, 2002 at 09:12 UTC
    If you have a valid business requirement for this data, this is (IMO) a dirty method of obtaining it.

    In an ideal world, you'd ask the developers of the website so make available to you the source of the data so you can get it yourself.

    If this is not viable, then i see no major problem with it. Infact the frequency of your "get" is entirely dependant on your requirement.

    Again, If its a valid business requirement, why not get it from the source?

    Update: My reply is from the POV of you being internal to the corporate site.

Re: Fetching data from a corporate websites using LWP
by George_Sherston (Vicar) on Aug 02, 2002 at 14:54 UTC
    On the rare occasions I've done something like this I've wrestled with my conscience a bit and then salved it by making my script strip out the addresses behind any banners on the site, and then use LWP to get the data from these addresses. Obviously, although this is nice for the site owner I'm now ripping off the advertisers, who think my LWP clickthrough is something to celebrate, and I'm not sure this is any less reprehensible; but it pushes the reprehensibility a little further away. Perhaps what you really need to do is write a script that clicks through the advertisers and then randomly buys stuff from them. But that way madness lies.

    § George Sherston
Re: Fetching data from a corporate websites using LWP
by mojotoad (Monsignor) on Aug 02, 2002 at 19:45 UTC
    I think the key here is the use to which you put the extracted information. If it's for personal use then I wouldn't worry too much about it.

    I wrestled a bit with a similar question -- I was distributing some modules that yanked information off of sites (historical stock quotes, to be precise). After some constructive conversations with pjf, I came to realize that scripts such as these are nothing more than a browser. The terms of service for a site apply to the user of the browser, not the author. So in this sense I passed the buck -- here's a tool, read the TOS of each site involved and see if it applies to *you* -- the TOS for the site does not apply to the tool in hand.

    After all, what if you use mozilla and banish images from certain advert servers? It's not the authors of the browser's fault -- they merely provide a useful tool. The TOS of the site applies to the user of the browser.

    As the user of your tool, you will have to examine how you are using the data you are fetching. If you're repackaging it or selling it as is, that's a problem ethically as well financially, potentially, if the information source comes after you. If, on the other hand, you are selling analytical work derived from the data, well, that's not so cut and dried since you're adding value -- as several people have pointed out, you should cut through the middle man and buy the information directly. Do this not to merely salve your conscience but to protect your legal liability.

    But if it's for personal use then I think you're just fine and I wouldn't worry about it. You're using a modified browser, end of story.

    Matt

Re: Fetching data from a corporate websites using LWP
by andreychek (Parson) on Aug 02, 2002 at 17:07 UTC
    Like many are saying, I definitely believe it depends on the site you are retrieving the data from. If they don't have advertisements on their site, you hitting it every few hours with LWP is less stressful on their server then somebody hitting it more frequently with interactive browser. Even better, you could set up your script to only work at night, when few people would be there.

    However, many sites have policies on this, which can often be found at the bottom of their site. For example, WhoWhere.com states the following in their terms of service:

    (You agree not to) Sell, distribute, or make any commercial use of data obtained from any Lycos database or make any other use of data from any Lycos database in a manner which could be expected to offend the person for whom the data is relevant

    -and-

    Use automated means, including spiders, robots, crawlers, or the like to download data from any Lycos Network database.

    Also, the terms of service for people.yahoo.com states:

    You agree not to reproduce, duplicate, copy, sell, resell or exploit for any commercial purposes, any portion of the Service, use of the Service, or access to the Service.

    The above statements make it sound like retrieving any data from either of those sites for any commercial purpose may be breaking their terms of service. So, I'd just make sure you read the terms of service and such for the site you're looking into. You may want to email them, and explicitly ask their permission -- they may let you do it, particularly if you tell them it'd only be once an hour throughout the night.

    Good luck!
    -Eric

    --
    Lucy: "What happens if you practice the piano for 20 years and then end up not being rich and famous?"
    Schroeder: "The joy is in the playing."

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://187023]
Approved by hsmyers
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (6)
As of 2024-03-28 23:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found