http://qs321.pair.com?node_id=748937

jlongino has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that has run without incident for 2+ years but now routinely hangs at random points during the data fetch cycle. It is a fairly simple program that fetches around 16000 records from an Oracle database on a different server and then builds a file from the results.

Unfortunately, no error or timeout occurs in the Perl program. This is the immediate problem since it is a cron job that kicks off at 7am and 11:30pm each day.

Since the Perl box is currently running Perl version 5.8.0, using $SIG{ALRM} is not an option. I'm looking at Sys::SigAction (there is a writeup on it here: http://search.cpan.org/~lbaxter/Sys-SigAction/dbd-oracle-timeout.POD) instead but was wondering if any knowledgeable Perlmonks might know of a different or preferred approach.

Replies are listed 'Best First'.
Re: Fetching Madness
by moritz (Cardinal) on Mar 06, 2009 at 20:47 UTC
    It might be interesting to know how it hangs - does the CPU usage go up to 100%? or is it waiting for some IO? Maybe network or DB connection troubles?

    You could try to start it with strace (or whatever the equivalent is on your platform) or attach a debugger to it to gather some more informations.

      CPU usage is minimal, it appear to be waiting for more data from the Oracle box. We are working with the administrators to determine what is going on from their side. I believe it is somehow connected to firewall issues.

      I've used truss and it shows that the last thing issued was a read. The point where it freezes looks like this:

      time() = 1236374108 write(2, " [ F r i M a r 6 ".., 70) = 70 write(8, "02 T\0\006\0\0\0\0\003 ^".., 596) = 596 read(8, 0x003A17C6, 2064) (sleeping...) read(8, "07DB\0\006\0\0\0\0\010\0".., 2064) = 1380 read(8, "\0\0\0\006 H a r r i s\0".., 631) = 631 brk(0x003AC710) = 0 brk(0x003B2710) = 0 read(8, "07DB\0\006\0\0\0\0\0\0FF".., 2064) = 2011 read(8, 0x003A17C6, 2064) (sleeping...)

      The '(sleeping...)' line is always the last thing printed.

      We've ruled out the IPS (Intrusion Prevention System) because we shut it completely off with no apparent change in behavior. We've also snooped the Perl box and it appears to confirm that the program is waiting on read results.

      Although I am worried about the cause, I'm more concerned now with finding a methodology that will terminate the program gracefully and alert us to the problem. Even if we do determine the source of the problem (I'm fairly certain we will), we'll still have the threat of similar problems in the future.

Re: Fetching Madness
by Marshall (Canon) on Mar 07, 2009 at 00:29 UTC
    Perl 5.8 should support ALRM even on Windows. To verify that, here is some simple code: This should just print nonsense for one second (could be a lot of lines if your processor is fast), but should die with an error message if $SIG{ALRM} is working on your platform. This program works on a Linux Perl 5.8 system and Windows Perl 5.10. I don't have a Windows Perl 5.8 system to test with.
    !/usr/bin/perl -w use strict; $| =1; #turns off output buffering #without this the "die" message might not be #seen due to STDOUT buffering $SIG{ALRM} = 'alarmhander'; alarm(1); while (1) { print "blah...anything..lot's of lines\n"; } sub alarmhander { die "ooops bailing out with die\n" }
      The author (Lincoln A Baxter) of the article I linked above explains that beginning with Perl 5.8.0, the $SIG{ALRM} doesn't work as it previously did, that the code just hangs. At least as it relates to connects and long running statements attempting to communicate with Oracle.

      I have not had the time to test his assertion, but have given him the benefit of the doubt. I also hesitate to quote directly from his article, though I imagine since it is on the cpan.org site, it might not be an issue.

      Since the code you posted is minimal, I will test it and relay the results here.

      It looks as though Lincoln A Baxter was right about the Perl 5.8.0 SIG ALRT situation. I modified my code to time out, and it did--until I realized it only times out correctly if the program hasn't "hung" yet. After setting the timeout to 60 seconds for example, the program hangs before the 60 second limit and the timeout won't occur.

      Next I'll test the Sys::SigAction solution to see if it works.

      Success on two fronts! The systems people discovered that the problem was due to a change in the firewall code. Something to do with deep packet inspection and differing buffer sizes.

      I had already gotten far enough along with the Sys::SigAction testing to find that it does work. I was about to post some code but it is practically verbatim to the examples in the link included in my first post above. I'm going to implement Sys:SigAction in my programs from now on to prevent the occurrence of perpetual hangs.

        Fantastic! I did some looking at this also. The problem is with deferred action on some signals and appears to be Perl version specific., eg Perl 5.8.2 will work differently than 5.8.0 and Perl 5.8.0 is different than Perl 5.6.x. The POSIX way >=Perl 5.8.2 is much better than the 5.6 way since you can do something other than just "blow up". There appears to be a way to "have your cake and eat it too" on more recent versions of Perl (apparently you can set the "safe" flag so that Perl checks ALRM often enough that the program won't hang). On the other hand, "blow up" is often the correct answer to something like "cannot connect to the DB".

        Anyway programming is often pragmatic and you have do just do what works! If that means reverting to 5.6 style signal handling to cover all versions of Perl you work with, so be it!

Re: Fetching Madness
by grinder (Bishop) on Mar 06, 2009 at 21:11 UTC

    Sounds like a db lock to me. Are you sure there are no new processes that have been introduced into your workflow recently that could be causing contention?

    You should log the database activity using Enterprise Manager and pore (paw?) through the results looking for inserts and/or updates that could be occurring at the same time.

    If you can't find out yourself what changed (because remember, your program didn't change, its environment did), then ask around in your workplace. It just might be that you run into someone who says that they moved a job back from 8am to 7am because they need to run some reports afterwards that have to be ready at 7:30.

    Good luck.

    • another intruder with the mooring in the heart of the Perl

      That didn't occur to me. I'll have the Oracle sysadmins look into that possibility. The Oracle database in question is our University's Banner system. Its workload changes constantly and is accessed by multiple systems/applications/users. A difficult task.

      So far, most of our debugging has been from the Perl side and through firewall and IPS monitoring tools. Maybe when we get some feedback from the Oracle sysadmins, we'll get more useful info. That's why I'm trying to concentrate on being able to recognize the problem and gracefully terminate the program.