Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Fast efficient webpage section monitoring

by Marcool (Acolyte)
on Apr 02, 2016 at 14:38 UTC ( [id://1159366]=perlquestion: print w/replies, xml ) Need Help??

Marcool has asked for the wisdom of the Perl Monks concerning the following question:

Hi All

The task I have at hand is as follows:
There is a webpage which changes periodically, and when it does, it displays a button which I need to click as fast as possible (basically before anybody else does).
The way I have implemented this so far is to write a script which uses curl to download the page, grep it for the "button" part, and if it finds nothing download again until change is found.
When change is found I also use curl to press the button which works ok as far as I'm concerned (although no doubt it could be improved).
The part I am not too pleased with is the one that monitors the webpage:
-Firstly it uses a lot of bandwidth, comparatively, since it's endlessly downloading the same thing over and over again...
-Secondly it is relatively resource greedy.

What I would like help with is making a decision as to whether by using LWP, HTTP::Monitor, HTTP::Tiny, or any similar module I might be able to improve on this. Or perhaps simply switching to wget. I'm quite ignorant indeed as to the performances of all these tools for this kind of job.

Another thing I was wondering is if there is a way to save time when the webpage actually does change, by interrupting the download for instance. I know LWP provides some kind of "loop back" as the download is proceeding, but I don't know quite how I could implement that.

I'm sorry if this question is out of scope, or too wide, or not well asked. Any help will be appreciated.

Thank you!
Mark.

  • Comment on Fast efficient webpage section monitoring

Replies are listed 'Best First'.
Re: Fast efficient webpage section monitoring
by Your Mother (Archbishop) on Apr 02, 2016 at 16:38 UTC

    This sounds like a JavaScript question. It also sounds like you’re trying to hack an auction, contest, or reservation system. All of which would certainly be against the ToS of any such site and, depending on the locality and service domain, maybe illegal as well.

    More details would be necessary to help you if I’m wrong. Probably no amount of details would get any monk to help you if I’m right. :P

      Hi,
      I should have thought to say: this is not an auction or such, it is a website that I perform translation on, and basically, translations are handed out on a "first come first served" basis. There is no mention of automation in the ToS (which is public, located at legal/translator-agreement on their website : gengo.com if you care to verify).
      Now, I understand if you consider it foul play to automate the accepting of translations, but the way I see it, is that I am located in Europe, and the servers seem to be in the US (according to geomaplookup.net) which I would expect gives people there a technological edge (to transfer the page it takes my browser 4.4 seconds) not very different from the one scripting the response would give.
      The reason I took to writing this in perl is because there are other conditions to verify (several translations might appear and I want to chose the "best" one in that case, I am using HTML::Treebuilder to work through the html), and having tried in javascript I found it was too far out of the little I know about the language. I feel more comfortable in perl although I am by no means experienced with it.
      As I said, I understand if people are not happy helping with this. It is not a very big deal for me, I am just using this as a "project" to practice my programming, as much as anything else. Although I do believe this is perfectly legal and fair.
      Thank you for your input.

        I toured the site a bit. It seems to be a pay-for-play service. How are you finding free translations there?

Re: Fast efficient webpage section monitoring
by Marshall (Canon) on Apr 02, 2016 at 21:44 UTC
    I took a look at this website. Interesting. There is a whole library of programs/tools including Perl API's oriented towards clients submitting jobs automatically and checking on the results. There are special API's for this. I didn't see quickly an API for the translators, but there may indeed be such a thing?

    Have you thought about talking with these folks and explaining your situation and the fairness issue with you being overseas? If these guys are smart they will set up an API just for you folks (the translators). Anybody who uses that API will have a huge advantage over somebody looking at a webpage or interpreting webpages continuously as you are doing (will be much faster than 4 seconds).

    I suppose the company will set up some sort of algorithm to decide how the jobs get distributed, probably not based upon sub-second response times. That seems inevitable if there are as many translators competing as your post indicates. If you get in there and ask about this, you have a chance of influencing the algorithms to your benefit, perhaps the accuracy of your translations and customer satisfaction or whatever. Basically you want to get this on an allocation basis that doesn't depend upon light-speed internet access where your connection puts you at a disadvantage.

    If you continuously load this page and this causes performance problems for the company, they will figure it out.

    It just seems odd to me that the company has spent so much effort on multi-language API's for clients that there wouldn't already be an API for the translators.

      It just seems odd to me that the company has spent so much effort on multi-language API's for clients that there wouldn't already be an API for the translators.

      The clients pay. The translators are paid and it is apparently a buyer’s market so… It would certainly be nice for the translators to have an API, it’s not surprising if it’s an afterthought or not a priority.

Re: Fast efficient webpage section monitoring
by BrowserUk (Patriarch) on Apr 02, 2016 at 18:05 UTC

    Leaving the legality to your own assessment/conscience; I would have thought doing a head and checking the headers for expiry information would reduce your overhead; though I'm not sure how that plays with in-page changes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Are you referring to the if-modified-since option in curl? Unless I'm mistaken, the site doesn't seem to be issuing this part of the header (the "Last-Modified" line is missing from curl -I http://...)
        the "Last-Modified" line is missing from curl

        That's one I was thinking about; there are other that might be useful if the server provides them:

        1. Age:The age the object has been in a proxy cache in seconds. Eg. Age: 12
        2. Content-MD5:A Base64-encoded binary MD5 sum of the content of the response. Eg. Content-MD5: Q2hlY2sgSW50ZWdyaXR5IQ==
        3. Content-Length The length of the response body in octets (8-bit bytes). Eg. Content-Length: 348
        4. ETag:An identifier for a specific version of a resource, often a message digest. Eg. ETag: "737060cd8c284d8af7ad3082f209582d"

        Basically, look at what headers are returned and check them from consecutive requests and look for anything that doesn't change per-request; but does change when the content changes.

        Just a thought.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast efficient webpage section monitoring
by Marcool (Acolyte) on Apr 03, 2016 at 10:08 UTC
    Thank you all for you input, and for the words of wisdom related to the ethics of all this. It has me thinking for sure.
    From the top down:

    Your Mother, despite your obvious disagreement with any automation of this process (and I do see your point, although this is not quite a matter of cheating for money, but for work, which is quite different), regardless, you've come forward with a very interesting proposal with Mechanize::Firefox and I want to thank you for your open-mindedness.
    Now I think of it, the webpage with the available translations does change when a new one is available, and no reload is required (nor running periodically as far as I can tell) so by looking into the way the site is achieving this I should be able to determine a better monitoring option, am I correct? And this will no doubt involve Javscript as you said from the start.

    Marshall I did know there were APIs for clients, as translators get specific instructions not to attempt to message clients when they use the API. But I hadn't considered how this might mean the company could provide APIs for translators. As Your Mother pointed out, the system probably works just fine as far as they are concerned without the API: where as some of the clients might not use the service without, translators will kind of accept whatever is available I suppose. The only options they provide is an RSS feed which is slower to update than the webpage (doesn't seem to make sense to me but I checked it and there is sometimes not even time for the RSS feed to show a new translation before it's gone, where as the webpage shows it), or an e-mail system, which as you can imagine is even slower than the RSS.

    BrowserUK thank you for all the precious leads concerning the head, I never knew there were so many potential items to an HTML header! Unfortunately this one is disappointingly bare: response code, content-type, date (which is almost always request date as the page is dynamic), location, server and that's it!

    flexvault thank you for the idea, but I'm not sure how I could balance the timer with the condition of having to be as fast as possible on getting the information from the site. Any "wait" is basically a hole in which an update could be missed right? But I do understand that if resources were really getting eaten up, I would have to introduce a timer to give the system time to "relax".

    Let me thank you all again, you are very helpful, and I'd like to say also that I do hear and respect your objections. They're not lost on me.

    Regards
    Mark.

      Like Marshall, I am also impressed by your reply and attitude.

      I am a bit of a WWW::Mechanize expert but not so with WWW::Mechanize::Firefox which I am not sure I ever even used. I have reached for WWW::Selenium but it's been probably four years since I did much with it and my impression is WMF will be more direct and maybe easier; Corion is likely to help you here if you get stuck on some point. That said, I'm reminded that the Selenium IDE will record your interactions and write them into scripts for you. Though apparently they have split Perl off and you have to download it separately now. :( http://www.seleniumhq.org/download/

      In any case, this code won't be easy unless your Perl, your JS, and your HTML/HTTP chops are solid. Like most web programming, no single part of it is hard but the coagulation of a thousand points of failure makes it so.

      All modern browsers have excellent developer tools panels to help you see cause and effect while you whittle the problem to a solution. Stack Overflow is usually an excellent place to get JavaScript answers; though if they are at least slightly related to a Perl issue, they're usually welcome here.

      Update: fixed a link.

      Great post. Glad to hear that you are seriously considering what is being said.

      I don't know much about AJAX, but I found this link with an explanation of what goes on AJAX is "Asychronous Java Script and XML". Click on their demo button to see a dynamic graphic loaded to an existing page without a complete page reload.

      I've only done very simple playing with Mechanize::Firefox, but I was able to talk to Firefox from Perl. The idea seems to be to let the browser run the Javascript and then monitor with the I/F what has happened. I think you can get a callback when the part of the page you are interested in changes. That way you don't have to poll Firefox, just wait for something to happen.

      If you are just watching what Firefox is doing while displaying the page, then you aren't adding any more traffic than what the webpage does on its own. This short AJAX message to update part of the page will be considerably faster than a complete page reload which I think is what you are doing now. So if done right, you should get faster answers while at the same time not generate excessive traffic to the site.

      I will defer to the Mechanize::Firefox experts, but I think this is possible. Sounds like you will have to understand the JavaScript in the page, but its not clear to me how much JavaScript you will have to write yourself. Firefox does the "heavy lifting" and you just watch what it is doing.

      I do suspect that these folks will implement procedures and possibly api's to help them manage a process that protects their brand from bad translations or undependable translators, etc. I worked for a while for a German company and all of the engineers spoke English to one degree or another. The professional translators did an amazing job on the documents. The proper English translation wound up being about 30% shorter. A computer cannot do that - its just too complicated. And the translator has to be a native English speaker.

Re: Fast efficient webpage section monitoring
by flexvault (Monsignor) on Apr 02, 2016 at 20:04 UTC

    Marcool,

    Also as BrowserUK has stated '...Leaving the legality to your own assessment/conscience...', maybe if you added a small timeout in your loop, your script may not be as greedy with system resources. You would have to test this to see if it helps. Try 250ms to start!

    Good Luck!

    Regards...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Fast efficient webpage section monitoring
by Marcool (Acolyte) on Apr 05, 2016 at 18:55 UTC

    Evening to you all.

    Just to give this thread a bit of a conclusion: I think indeed that the right way to go is to look at the AJAX, that is definitely the only way to get out of reloading the page. I have been digging into the code, and it does indeed contain a lot of XMLHttpRequest calls.

    As to the best way to implement watching of this particular page, the key - I humbly think - lies in the fact that the page watches itself, and that it - probably - does it in a rather reliable way. So the code in the end should just look something like:

    use WWW::Mechanize::Firefox; my $mech = WWW::Mechanize::Firefox->new(); $mech->autoclose_tab( 0 ); $mech->get("$URL"); while (1) { $mech->events( "document.database_changed_function()" ); #not got this part working yet... };

    I agree that one interesting avenue would have been Selenium, and I now remember that at the very onset of all this, before I figured out that the "button" was a simple POST request hidden under a thousand layers of Javascript, I had looked a Selenium for its "imitation" capability (script what you do). Never went too far down that avenue though.

    What I am confronted with now is the sheer complexity of the JS in this page. I am having a million issues with scope and such whenever I try to get Mechanize::Firefox to interact with the page. It's written using angularjs.org which in and of itself is super cool, but insanely complex as far as the JS goes (no doubt the JS gurus would contradict me there but hey, I'm not a javascript guy!)

    So here I am, trying to figure out which function is supposed to keep a watch on the database of translations, and what scope all this is happening in...

    Maybe I'll figure it out (preferably before the site migrates to Angular 2 and the whole thing changes, haha), maybe not, but in any case I'm learning a lot of JS, and perl too for that matter, the Mech::FF module is quite instructive.

    Once again thanks for your help and for hearing me out. Appreciate all the kind comments.

    Will be back soon with some more questions no doubt.

    All the best to you all,
    Regards, Mark.

      There may be no "public" or named function that does the job. A very common idiom in JavaScript land is to use nested anonymous functions to get work done. What you are looking for may be an anonymous function in a nested chain hooked to an event and essentially impossible to get at from outside.

      Premature optimization is the root of all job security
        Now that would make a lot of sense... I see lines and lines of nested functions, and can't seem to figure out when the call to the one containing the XMLHttprequest is actually made! I really have some javascript brushing up to do! Thanks for the tip :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1159366]
Approved by stevieb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-24 13:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found