WWW::Mechanize timeout problem

Marshall has asked for the wisdom of the Perl Monks concerning the following question:

I have a LWP type program using Mechanize that has been running every hour for the past 7 years without problems. All of a sudden, I get an error report that I determined resulted because of a corrupted DB - a duplicate value appeared in a column where all the values should be unique. The code is specifically designed to prevent this. However, I theorized that if 2 instances of the program where running, they could "fight" and cause this problem. But I didn't see how that could possibly happen, until I saw this in the log file:

2022-08-25 04:21:18|Good record fetched here
2022-08-25 04:21:23|Error: Retry Attempt 1 of 3
2022-08-25 05:45:16|Error: Retry Attempt 2 of 3
2022-08-25 05:45:53|Next good page fetched here
[download]

Retry #1 took about 90 minutes!! WOW!

Here is what the ancient code does:

my $success=0;
my $tries=0;
while (! $success and $tries++ < 3) 
{
    eval { $m2->get($fullurl); };
    if (! $@) {
        $success = 1;
    }
    else {
        print STDERR "".cur_gmt()."Error: Retry Attempt $tries of 3\n"
+;
        print LOG    "".cur_gmt()."Error: Retry Attempt $tries of 3\n"
+;
        sleep (2);
    }
}

die "aborted Web Site Error: $!" unless $success; #ultimate failure!! 
+PROGRAM ABORT !!!!
[download]

The retry ultimately succeeds but the long wait time has the effect of pushing the run time into the next hour's timeslot.

This is a normal HTTP (not HTTPS) Url. Over the years, there could be that a lot of things that have changed at the website's end. I have no idea. The default timeout for Mechanize is supposed to be about 3 minutes - it really doesn't matter to me as long as it is not measured in hours! I don't know how often this super long request problem happens. A retry historically happens about every 2-3K requests with this particular site and a couple seconds later, all is well. I have no idea what is actually causing the hang.

Thoughts and ideas are welcome.

Comment on WWW::Mechanize timeout problem Select or Download Code

Replies are listed 'Best First'.
Re: WWW::Mechanize timeout problem by hippo (Bishop) on Aug 25, 2022 at 08:47 UTC
The default timeout for Mechanize is supposed to be about 3 minutes WWW::Mechanize inherits the timeout from LWP::UserAgent. The default timeout in LWP::UserAgent is precisely 3 minutes. However, that's just the default. You could always include the value of `$ua->timeout` in your log message. Moreover, the docs for the `timeout` method say: `The request is aborted if no activity on the connection to the server is observed for timeout seconds. This means that the time it takes for the complete transaction and the "request" in LWP::UserAgent method to actually return might be longer.` So if the server (or client) is heavily throttled this could easily extend into hours. If you want an absolute wall-clock cut-off then you will need to implement that yourself. 🦛	[reply] [d/l] [select]
Re^2: WWW::Mechanize timeout problem by Marshall (Canon) on Aug 25, 2022 at 10:46 UTC
So, if I understand this correctly, with the default setting, "something" has to happen between the client and the server every 3 minutes or the request will be aborted. That then implies with 90 minutes there must have been at least 30 "somethings" happened on the connection. These pages are not "big" max maybe 2K bytes and are straight HTML, no JS to bloat things. I saw this on using my home Windows machine as the client - no throttling going on at my end once a request is initiated. I do have some conscious throttling to reduce the rate of page requests from me - this thing is designed to "be nice" to the target website. I don't really understand how many transmissions back and forth (or "over's" are needed to transfer a page that is a lot smaller than the Perl Monks page I am typing this into. I guess I am pretty much stunned - I was thinking like maybe it takes 10 "over's", that would add 30 minutes and wouldn't be a problem. Obviously my thinking was too primitive! I am not sure about implementing my own timeout. The only way I know how to do that would be with SIGALRM. There is only one of those and if LWP is using it, then I am worried about conflicts. Suggestions welcome. One approach that I am considering is implementing a lockfile. When the Windows Scheduler wants to "go", a bat file would check if lockfile exists and if so, then abort that run and let whatever is running just keep running. When the bat file sees the exit from my software, it removes lockfile. I think net effect would be that I occasionally miss an hourly update. That is acceptable to be as long as it doesn't "happen too often" with definition of that TBD. From the log file that my startup bat file will make, I can look back and see how often and at what times of day/week this is happening. I suspect that it is not conscious throttling at the other end, but rather a glitch in the server's software that occasionally causes a barf. The sysop may not even be aware that this is happening. Normal run time for my software is just a minute or two - max is about 5 min. Right now I am making a humongous run because I am trying to recreate the problem. But in normal operation where the error report came from, this software is very "low key". And most hours, it doesn't do much of anything. Update: I got some more data from my overnight stress run. Fetched ~154K pages over about 13 hours. This resulted in 4 retry sequences being initiated. The max elapsed time in the 4 retry sequences: 1 sec, 90 min, 30 sec, 30 sec. A typical second has 3-4 requests. But a typical hour only has about 15. Its looking like the lock file approach will work. I will think some more about how to bulletproof it so that this thing won't hang for a long time without him knowing about it. It is clear that the time to complete a request can be much longer than the 3 minute timeout value.	[reply]
Re^3: WWW::Mechanize timeout problem by NERDVANA (Deacon) on Aug 26, 2022 at 19:55 UTC
For Windows, the simplest solution is a parent monitor process that kills the child worker after a timeout. See Proc::Background. You can even write it in a generic way that adds timeouts to any script you might launch through it. By the way, SIGALRM doesn't actually exist on Windows; it'll be a perl emulation which might not behave the same way. I actually don't know if I've tried it on Windows before. Hopefully LWP::UserAgent is written with select() rather than SIGALARM, but I don't know that either. You could also try an event library like Mojo::IOLoop or AnyEvent or IO::Async with a matching event-based user agent like Mojo::UserAgent, AnyEvent::UserAgent, or Net::Async::HTTP. These involve re-writing your script significantly, but then you have all the benefits of event-driven programming at your fingertips, and a timeout is super-easy.	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks