Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to stop web interface bypassing?

by advait (Beadle)
on Mar 25, 2008 at 14:41 UTC ( [id://676127]=perlquestion: print w/replies, xml ) Need Help??

advait has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
I have made a website. Which take user input from various form elements from the html page. Now I have realized that some of the users are not using the webinterface to input the information, rather they are passing it directly through some scripts. This is causing problem as
1. server is getting invalid data to process
2. increasing server load

Please help me and suggest how can I avoid such nasty users
Thank you

Replies are listed 'Best First'.
Re: How to stop web interface bypassing?
by philcrow (Priest) on Mar 25, 2008 at 15:03 UTC
    On behalf of everyone who has needed to automatically interface with a browser only web service, let me urge you to at least consider letting people use their own tools to hit your service. This is especially important if there is some business to business relationship involved. Please do not think that your business partners should hire staff to surf your site. That just makes it harder for those of us who must do it automatically, because we cannot afford the staff, to fool you.

    Rather, think about the problems and address them. It is never safe to assume that the client in a web interaction is feeding you safe data. You must validate it on the server, even if you have client side validation for the benefit of manual users. If certain people are overloading your site, protect it from them in some way. Perhaps simply by dumping anyone who feeds invalid data.

    Every system you use to try to force people to use a browser manually can, and will, be spoofed, since the protocols are fixed and the browsers are well known. You'll have to protect yourself in some other way anyway. This is not an easy problem as you can see from all the captchas and other schemes people try to use to limit spam bots. If the users in question are genuine I would try to accommodate them, not ban them.

    Phil

    The Gantry Web Framework Book is now available.
Re: How to stop web interface bypassing?
by moritz (Cardinal) on Mar 25, 2008 at 14:51 UTC
    You can't prevent it. No way.

    So you need to 1) validate all input data on the server side and 2) forbid or restrict the script in your robots.txt.

      I dont have much of knowledge ...can you tell me something about robots.txt
        I could, but thounds others have done the same before: robots.txt
Re: How to stop web interface bypassing?
by Limbic~Region (Chancellor) on Mar 25, 2008 at 15:29 UTC
    advait,
    I am suprised (and happy) to see no one mentioned checking the HTTP referer. It, like every other technique to reduce the amount of undesired traffic, is not a complete solution. That is the nature of the beast.

    What I am surprised is that no one mentioned defining a Terms Of Service agreement with folks using the site (though robots.txt is a good start). Explain in exactly what ways it is acceptable to interface (API, how frequent are requests, etc). This tells people what you expect and what you will do if they don't abide by those rules.

    Then all you can do is protect yourself, identifying abusers, and take other steps to block them.

    Cheers - L~R

      HTTP Referer checking + hidden field can get rid of 90% nasty user

      Some reference:

      The left 10% you may consider double cookie (used by blogger), compulsory cookie or session

      compulsory cookie (or force cookie)

      session

      You may be interested in the modulus of Apache:: and *Session in CPAN

Re: How to stop web interface bypassing?
by dragonchild (Archbishop) on Mar 25, 2008 at 16:03 UTC
    A browser is a script that has some manual bits to it. You should never assume that requests to your site are coming in any order, based on any information you gave in the last response, or, frankly, with any structure. Not only will this make your application capable of handling scripts, but it will also make your application more secure.

    The first thing a cracker tries to do is find out what your application responds to by reading your HTML. Then, they try different variations of the parameters. Sound familiar?


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: How to stop web interface bypassing?
by samtregar (Abbot) on Mar 25, 2008 at 17:10 UTC
    To deal with #1 you need to validate all data before processing. I like to use Data::FormValidator for this.

    Dealing with #2 is harder. Doing validation early can help, since you won't be doing expensive processing on invalid data. But if someone is really out to get you they can flood you with valid traffic too. I wrote CGI::Application::Plugin::RateLimit to deal with this problem for a CGI::App, but it's only as useful as your ability to distinguish one client from another. This problem can also be dealt with at the network level by your firewall or by something like Apache's mod_throttle.

    -sam

Re: How to stop web interface bypassing?
by dsheroh (Monsignor) on Mar 25, 2008 at 17:16 UTC
    As others have alluded to, there are basically only two things you can do to prevent people from programmatically accessing your site:

    1) Ask them not to. Your ToS is a means of asking the user not to do this and robots.txt asks the program itself to pass over (parts of) your site. But you have no guarantee that the user will read the ToS nor that the program will read robots.txt and, even if they are read, they may be ignored.

    2) Threaten to sue anyone who doesn't use the site in the way that you prefer. While this can be very effective at preventing automated use of your site (if you have the money to spend on lawyers), it is generally more effective at preventing manual use, as many of us prefer not to deal with litigation-happy sites.

    Automated use is not the actual problem you're facing anyhow, so you would do better to just accept that and deal with the real problems, namely invalid data and high server load.

    For the first, you must, must, must validate the data received from the client. Rule one of designing a networked application is to assume that the other end of the connection may be lying to you and clean up, sanity-check, and otherwise validate all received data. This isn't even just for websites - there's a long history of networked games, from Doom and Quake to the latest MMOs, which have had massive problems with cheating because they foolishly trusted the user's software, probably in the belief that they had a "secret" protocol which only one other program (the game client) knew how to talk. HTTP is very simple and very well-known. Writing a dishonest HTTP client is trivial.

    There are several options for dealing with load issues, ranging from limiting the number of server processes allowed to spawn at any one time to caching results and returning them as static pages instead of reprocessing data on every request to blocking IP addresses that issue too many requests too quickly and many other things in between. Or you could just ask the users who are writing robots for your site to please configure the robots to issue no more than, say, one request every 5 seconds. Which option is best for you is highly situation-dependent.

Re: How to stop web interface bypassing?
by polettix (Vicar) on Mar 26, 2008 at 08:19 UTC
    After having done your homework in validating input data and throttling accesses, you can also consider introducing some CAPTCHA to (try to) ensure there's a human on the other end of the wire.

    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Io ho capito... ma tu che hai detto?
Re: How to stop web interface bypassing?
by dwm042 (Priest) on Mar 26, 2008 at 14:43 UTC
    advait, I may not be reading your question the same way as others, but if you are using something like FormMail to do forms processing, I would suggest switching to a secure replacement that can block data by domain or IP. That way you can force the use of your web interface as the data entry point. The nms FormMail replacement allows you to tell it which sites can send data to your form processing code.

    Now, at this point, whether you block mechanization is another issue entirely.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://676127]
Approved by moritz
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found