Thwarting Screen Scrapers

kschwab has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Thwarting Screen Scrapers by dws (Chancellor) on Jul 18, 2002 at 16:25 UTC
I'm curious if anyone has an experience in protecting a web-based interface from being "front-ended" by others for their own gain. I've spent a fair amount of time on the other end of this problem, dealing with issues around how to co-navigate web pages that are in some way protected. (This was for a customer service application, where customers' support organization was having to work around roadblocks of the type you're looking to set up, set up internally by the web side of their organizations.) If you're willing to put some work in on the back end, one way of throwing a spanner in the works of anyone who is hijacking your form submission process without your noticing is to do the following: When you generate the form, allocate an ID and record it on the backend (e.g., in a database, along with a timestamp that you can use to time the form out). Generate an MD5 hash based on the ID and some secret key known only to your application. Add this hash as an argument to the form action URL¹ (i.e., add "?key=$hash"). Put the ID into the form in a hidden field. When a form is submitted, it's a simple matter to Check to see if the ID has been used already. This prevents them from grabbing one legitimate key/ID pair and reusing it. Check to see if the ID has expired (if you care) Generate a new hash based on the ID and your secret key, and compare to the one in `param('key')` This leaves the exploiters in the position of either having to come to your site to get a form, or trying to guess your secret key. If they have to come to your site to get a form, you can track and ban them. If your submission form is framed, you can do an automated check using your weblogs for form submissions that aren't matched to a fetch of the framing page. This isn't 100% accurate, but the repeat abuser is who you're looking for. ¹You could put the hash into a hidden field instead. I recall there being some reason why having it be part of the URL was advantageous, but don't remember specifics. It might have had to do with getting into into the weblogs for later processing.	[reply] [d/l]
Re: Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 17:27 UTC
Thanks...this is the kind of input I was looking for. Obviously any type of measure has a countermeasure, and if it works on a browser, It would work in LWP ( or some other interface ). The addition of a timestamp into the hash calculation is an interesting one. We've already worked out a method of using dynamically generated form field names from a hash of the session key. Adding the timestamp purturbs it a bit, and keeps someone from keeping a session alive over a long period of time. dws++...thanks again.	[reply]
Re: Re: Re: Thwarting Screen Scrapers by dws (Chancellor) on Jul 18, 2002 at 17:40 UTC
The addition of a timestamp into the hash calculation is an interesting one. Interesting, but not what I intended to suggest. Using a timestamp when generating the hash needlessly complicates verification. What I meant to suggest was that you save a timestamp when you record generated IDs. This gives you an easy way to "time out" forms, and flush abandoned forms out of your back-end database. It also sets you up for doing some analysis on things like average submit time (the gap between your generating the form, and a user submitting it). A really low submit time is an indication that there's a bot on the other end of the line.	[reply]
Re: Thwarting Screen Scrapers by ignatz (Vicar) on Jul 18, 2002 at 14:33 UTC
Back in the dot com boom I spent a few months working for a company that scraped sites that had logins so that one could store all of them in one place and only have to register once. Many sites welcomed it because it got them new members. Some didn't and took counter-measures. Changing the form elements, moving the locations of the forms or changing the required cookies all played havoc on our application. The most effective weapon was sites that simply blocked our IP address. As for cookies and HTTP_REFERERs and the like, just because something that you do can be hacked doesn't mean that you should assume that they have hacked it and not check for it. This gives them the luxery of not even having to hack it in the first place. Generally, what these guys are doing isn't rocket science. Changing things even a little bit will throw a big spanner into their works. Making sure that your form validator confirms that EVERYTHING is as it should be will also be a big plus. ()-() \"/ `	[reply]
(jcwren) Re: Thwarting Screen Scrapers by jcwren (Prior) on Jul 18, 2002 at 16:16 UTC
I know this sounds evil, but for the moment, the best way to prevent screen scraping may be to use Flash. Flash now supports forms, submissions, authentication, yada yada yada. This comment was not intended to address usability, nor the ability of non-Flash enabled browsers to purchase your product. But Flash is becoming pervasive, and is available on the platform that is most likely to be your customers browser of choice. This will change, no doubt, as clients side Flash tools become as pervasive as various CPAN modules like LWP, etc. But for now, it seems to be the least likely method to address the issue you're concerned about. --Chris e-mail jcwren	[reply]
Re: Thwarting Screen Scrapers by tjh (Curate) on Jul 18, 2002 at 14:41 UTC
From the subject line I expected a conversation on hijacking content, possibly RSS or other news feed issues, copyright arguments, maybe even allusions to the U.S. entertainment industry trying, with vehement avarice, to technologically block any re-recording of anything (lol), and other things... :\| Instead, I can't tell if you are a merchant that is somehow being disintermediated by your own reseller or what - even though you're still making the sale. I'm confused. If you're still making the sale and collecting the payment, I don't get it. Has someone pre-empted your front end? Why would they do that? If you're being targeted and your site hijacked that's different. If you have soft content, news or other written content, that someone is scraping and calling their own either by redisplaying on their own site, this is a different matter - a legal one without good Perl-specific solutions. Did you state your problem exactly - or is this a drill? Update: just read your follow up. The tech tactics are being listed by others (dynamic session id's per page call, dynamic field names, etc.) In an ideal world all session mgmt and user authentication would be application level with high granularity - down to each page or function call from the client, every time a request arrives. I know of no current solution, Perl or otherwise, that solves this completely. Would love to see one though. On the other front from your example, I have had this exact experience 2 times. All the technology solutions in the world won't stop someone who relentlessly intends this fraud. You have to detect them, copy the fraudulent material, get witnesses - do whatever your lawyer tells you to do about the copyright violation (and hope it's domestic). In one of my experiences, a simple email solved it. The other got a little warmer...	[reply]
Re: Thwarting Screen Scrapers by Abigail-II (Bishop) on Jul 18, 2002 at 15:31 UTC
In an ideal world all session mgmt and user authentication would be application level with high granularity - down to each page or function call from the client, every time a request arrives. I know of no current solution, Perl or otherwise, that solves this completely. Would love to see one though. Yes, this level of authentication can certainly be done. I'm currently involved in a large project, where this is being done, and we even go further. Unfortunally, I can't tell you more. It's not simple, and it takes large investments. The question isn't "can this level of authentication be done", the question is "how much are you willing to pay?" (pay in a broad sense - mostly costs to hire people). Abigail	[reply]
Re: Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 14:54 UTC
You're right, I haven't included all the details. I was trying to keep this generic enough to apply in more than one situation. Basically, I'm selling something direct via a website. I have no resellers. A set of people I don't know at all have created their own websites, but they are nothing more than a shell around my website. They make money by adding a "service charge" and billing it to the customer. ( Without adding any apparent value ) They take all the http and https requests from the Customer, via their own forms, and then take the data and make simulated browser requests to my site to make the purchase. Other areas, such as feedback, etc, are directed to my site as well. They obviously feel they are doing something wrong, since they hide behind unprotected web proxy servers and use other "stealth" techniques to make stopping them difficult. If it were just one party, a legal approach would work. Unfortunately, this situation happens over and over again, with a different set of front-enders, sometimes with an offshore website.	[reply]
Re: Re: Re: Thwarting Screen Scrapers by tjh (Curate) on Jul 18, 2002 at 15:22 UTC
I see (I think). They're processing their own forms (order and payment) themselves, then, in turn, mapping the same sequence on your site. Does this mean that every time an order is made and paid on their site that they cause the same on yours? Are you getting the original customer name, addy, etc., or would you know? Real-time detection is possibly the first goal. Unless there is something unique you can detect in the incoming 'ghost' client that you can block with, maybe you can work to detect duplicate payments, shipping addresses etc on the tail of the transaction - which assumes that your new 'partners' are ordering from you then re-shipping to their customer. If they are taking the customer data from their own forms and re-submitting it to you, including payment (CC#?) info to you - with a markup - how are they collecting their markup? If they are collecting their full payment using the customer's payment data, THEN resending that same payment data to you, effectively double-billing the buyer, this is a much different type of problem and you should be contacting law enforcement. From the looks of your other responses in this thread - methinks you need to do both - tech and legal. If you have a product that is inspiring so much theft/fraud, you need to protect it immediately - but not so protected that it can't be sold at all... :)	[reply]
Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 14:35 UTC
I had hoped to limit this to the technical rather than philosophical points, but it looks like the replies are headed elsewhere. How about an example close to home ? I create a site called perldudes.com. Instead of developing my own community, I front-end perlmonks.com, taking the inbound http requests, pulling nodes from perlmonks, and substituting text as needed. ( s/perlmonks/perldudes/g, etc). I also put in my own advertisements and content, and maybe the interface is really crappy. As for what this has to do with perl, It's obviously a bit off-topic. I am interested, however, in how any technique might be implmented in perl, and what modules might help me along. I am aware that there is no way to completely stop this sort of thing. I'm looking for the best ideas to slow it down or at least stop the simplistic attempts Abigail: I understand your points, but...If someone else can sell my product, but creates the whole customer selling experience, and I have to create, ship and support the product, how is that okay ? The "scrapers" go to great lengths to make sure the Customer doesn't see the fact that there are two parties involved. They also dish off support, etc, by front-ending the feedback forms.	[reply]
Re: Re: Thwarting Screen Scrapers by dwatson06 (Friar) on Jul 18, 2002 at 14:56 UTC
kschwab, Find the ip it's coming from and block it. Unless it's client side scripting, it's going back to a central computer (or series of) somewhere. Block it. If you want to be really cool. When you find the IP, catch the POST or GET, go out to theonion.com or some other random site, do a get yourself and hand it to their request. The bad part about that is, you would be doing what they are but it would confuse the mess out of them for a moment ;o) Daniel	[reply]
Re: Thwarting Screen Scrapers by Abigail-II (Bishop) on Jul 18, 2002 at 15:02 UTC
I still don't see the big problem. If a person would go to your site and order something, you will have to create, ship and support the product. Just like you have to do when they go to someone elses site. Of course, if you don't want to create, ship and support a product, why do you have it? I do assume you are getting paid for creating, shipping and supporting the product. If not, and it's a burden to you, perhaps you should stop. ;-) What interests me is how they manage to get in the middle when it comes to paying. How are they getting their share? If they take a credit card number, take their share from the account, then pass the number to you so you take your part, the customers will frown, and someone will think "fraud". Abigail	[reply]
Re: Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 15:09 UTC
It's not one situation, but many. Indeed, some of them do make their own charge on the credit card, and I end up handling the resulting mess. There's several variations on the theme, some of them actually calling out the correct name for the product, but acting like they are some sort of authorized reseller. Other ones have a relationship with vendors of similar products, and get paid for those purchases. They include my product only for completeness, and make no money on the transaction. They do, however, get the Customer eyeballs, and create confusion. My product gets tied in with their advertisements, or perhaps their interface keeps crapping out, and I get that feedback.	[reply]
Re: Re: Thwarting Screen Scrapers by ignatz (Vicar) on Jul 18, 2002 at 15:19 UTC
Part of the retail process is trying to get the customer to come back and give you more of their money. Hard to do that if they don't know who you are. ()-() \"/ `	[reply]
Re: Thwarting Screen Scrapers by mojotoad (Monsignor) on Jul 18, 2002 at 14:30 UTC
Aside from the varioius comments above, I would add that if the scenario is as you describe, make sure that you show up in any cost comparison meta-sites. You're guaranteed to be the lower price. Matt	[reply]
Re: Thwarting Screen Scrapers by fireartist (Chaplain) on Jul 18, 2002 at 15:24 UTC
How does your billing backend work, and do you store cc numbers? Why do I ask? I presume that if they are charging the customer extra, and keeping the profit, that they are charging the customers creditcard themselves, and then sending their own payment details to you to make the purchase from you. The ony way they could get round this were if they charged the customers cc a small fee themselves, and then sent the cc number to you to charge the rest. - and I hope that anybody would think this very suspicious if they saw this on their statement. So, I can see 2 possible solutions to counter this. If you store the cc numbers, then check to see if the same number is being used multiple times for the same product. Check the customers address against the cardholders address to see if they're different.	[reply]
Re: Re: Thwarting Screen Scrapers by grantm (Parson) on Jul 19, 2002 at 10:49 UTC
do you store cc numbers? Wouldn't that be an incredibly bad practise? I have worked on a number of ecommerce projects but none of them stored the credit card number. Ever. If you store card numbers in your database and your server gets cracked then the cracker can get all the card numbers. My legal knowledge is small but I'd have thought a system design like that would leave you open to criminal negligence suits. If you don't store the card numbers there is no exposure.	[reply]
Re: Re: Re: Thwarting Screen Scrapers by fireartist (Chaplain) on Jul 19, 2002 at 11:53 UTC
I know, I was going to add a disclaimer, but didn't bother. I said "do you?", because I know that some do it. - Amazon, for example, records my cc number. I have read about methods of storing cc numbers by using a machine behind a firewall, which the cgi server can access, but can't itself be accessed directly from the internet. I don't know all the implications/applications of this, so that's why I didn't go into it. (and don't really want to still ;)	[reply]
Re: Thwarting Screen Scrapers by Sifmole (Chaplain) on Jul 19, 2002 at 11:46 UTC
I don't see your problem. If you charge $100 for one unit of ProductX: for packaging, shipping, product, and support; and the other guy charges $120 for one unit of ProductY (aka ProductX) and then pays you $100 and you package, ship, manufacture, and support -- you get paid the same for the same amount of money. All you got is someone out there doing free marketing for you. If the problem is name brand recognition, well then.... Just go to your local Kinko's print up some package inserts: If you bought this product from anywhere other than www.iwanttosellmyownstuff.com, you may have paid too much. Please visit www.iwanttosellmyownstuff.com in the future for lower prices on this wonderful do-ma-higgy. Thanks for you patronage.	[reply]
Re: Thwarting Screen Scrapers by Abstraction (Friar) on Jul 18, 2002 at 13:35 UTC
This is just an idea and I'm sure someone will have a way around this. But when you display the form, set a cookie with a known, difficult to guess value. When you process the form, check for the existance of that cookie. If someone is posting from another domain they won't have have that cookie. You can also check the refering URL, but that can be spoofed I think.	[reply]
Re: Thwarting Screen Scrapers by Abigail-II (Bishop) on Jul 18, 2002 at 13:40 UTC
Nothing of course that prevents the other side to make a query to your site and get a cookie. Abigail	[reply]
Re: Thwarting Screen Scrapers by neilwatson (Priest) on Jul 18, 2002 at 13:41 UTC
As Abigail says, so what? The whole point of the web is open standards. HTML is not hidden. If it was, it would not be as poplular. Now you want to hide so that you can sell your productX better than someone else can sell productY? A product should sell on the merits of its performance and quality. Not on how slick your website is. Neil Watson watson-wilson.ca	[reply]
Re: Re: Thwarting Screen Scrapers by kschwab (Vicar) on Jul 18, 2002 at 14:44 UTC
That's missing the point. I don't care about honest competition. I just don't think someone should be able to leverage my infrastructure to re-sell my product. This leaves me no control over the selling process. The front-end does whatever it likes. Suppose they make a claim that the product has awesome feature xyz. They then take the money, hit my website, I take the order and ship it. The customer opens the box, finds out feature xyz doesn't exist, then finds the support contact info in the box. They call me and ask about feature xyz. Bah. Update neilwatson: Yes, in some cases it is fraud, and legal action is taken. The reason for the post was to find ideas to do whatever I can to discourage it in the first place. I'd rather make it hard to do than wait for it happen and take legal action.	[reply]
Re: Re: Re: Thwarting Screen Scrapers by neilwatson (Priest) on Jul 18, 2002 at 15:02 UTC
So we are talking about fraud? There's not really a productY at all. It is productX purchased from you, marked up and resold in a fraudulant manner. Surely there is a way for the product to be traced to whom is was sold to (the "scraper")? Product serial numbers? Perhaps finding the sites for these "scapers" and bringing legal action against them. Neil Watson watson-wilson.ca	[reply]
Re: Thwarting Screen Scrapers by Rhose (Priest) on Jul 18, 2002 at 17:35 UTC
First off, let me say I have no experience in this area, but while reading this thread, I had a thought -- could you use a technique similar to the ones used to combat votebots? Since it seems you want your form to interface with a human and not a computer script/program, how about generating a confirmation image which must be reentered by the purchaser? I can't imagine the other site will want to hire people to sit around and reenter images as your product is purchased. merlyn has some details here (Even though jcwren was able to get around it; see A little fun with merlyn Smiles) Update Silly me, got a snack and realized they will just pass the image to the end user, capture the entered input, and forward it back to your form...	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.


Perl Monk, Perl Meditation
	PerlMonks