http://qs321.pair.com?node_id=482736

jimbus has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I've got a hand full of scripts running in cron on a sun box that basically process log files and insert the digested data into mysql. Things appear to work flawlessly when I run them by hand, but when left alone as cron scripts, I end up with 90 some processes and issues accessing the data in my web app because all connections are used up... Apparently, I'm not all that and a bag of chips :(

Have you guys got any pointers for DBI or cron script best practices?

Thanks,

Jimbus


Never moon a werewolf!

Replies are listed 'Best First'.
Re: cron script best practices
by 5mi11er (Deacon) on Aug 10, 2005 at 20:29 UTC
    Starting from the thing that popped into my head first, why are you getting so many processes? Do you start with one or a few, and then they start replicating? If so, I'd guess you've got a case of the *'s syndrome. The * * * * * * at the beginning of the crontab file basically tells cron to kick off your script(s) every minute.

    If you're not doing that, then you really need to figure out why you get so many processes, because something is pretty obviously broken.

    Other potential traps to avoid include

    • Jobs run from cron have a very limited environment compared to most user environments. If the scripts need a better environment, you need to ensure they get that environment.
    • Paths are part of the environment discussion above, either fully specify them, or make sure the scripts compensate appropriately.

    That's all I can think of for now, hope this helps your situation

    -Scott

      Here is my crontab:
      reports@clarkkent/home/reports(7): crontab -l
      00 06 * * * /home/reports/ftp/SMSC0/loadData.pl
      00 06 * * * /home/reports/ftp/MMSC1/loadData.pl
      20,35,50,05 * * * * /home/reports/ftp/YTSMSC50/loadData.pl
      20,35,50,05 * * * * /home/reports/ftp/FDSMSC/loadData.pl
      00 06 * * * /home/reports/ftp/proptima/ftp.pl
      

      loadData.pl is the script I'm checking on ("ps -ef|grep loadData|wc -l"). The first two should be running once a day at 6am and the second two every fifteen minutes... which is 96 times each per 24 hour period. I'm assuming the issue is with the second two, which are the same but for different boxes.

      These scripts digest a log file that is a series of reports from about 12 nodes, each one has between 50 and 225 key and value pairs, one per line. I loop through the nodes, building a hash of the key/value, then build a huge insert based from them... with upto 225 columns, the insert is built dynamically.

      I have filled /usr a couple times, once recently. I thought things would recover, but I end up with all these processes and mysqld running at 60-70% of cpu.

      I guess the real thing is I'm resource strapped and perl inexperienced and getting a bit overwhelmed by the amount of data being chucked at me and was hoping to find someone who had documented what it took to write mature cron/logging scripts :) With Perl and JDBC for JSP, I find all kinds of simplest case stuff on the web, but not a lot on what I would think would be typical useage patterns.

      Thanks,

      Jimbus

      Never moon a werewolf!
        20,35,50,05 * * * * /home/reports/ftp/YTSMSC50/loadData.pl 20,35,50,05 * * * * /home/reports/ftp/FDSMSC/loadData.pl

        Do these scripts need to be run simaltaneously? You could immediately reduce the number of connections to your DB if one script ran, executed, exited, and then the other was fired off.

        i.e

        20,35,50,05 * * * * /home/reports/ftp/YTSMSC50/loadData.pl; /home/repo +rts/ftp/FDSMSC/loadData.pl

        I've got to ask - do any of your inserts work at all? As mentioned in another response, if your script is working fine from hand but not from cron, it may be an environment issue. I'd modify my cron to something like

        20,35,50,05 * * * * /bin/env > /tmp/env.output; /home/reports/ftp/YTSM +SC50/loadData.pl; /home/reports/ftp/FDSMSC/loadData.pl

        and then check the contents of /tmp/env.output and compare them to the output of env when you run it at a command line for any important/potential differences. You could then set these env variables to your perl script.

        Some other obvious things would be to make sure that you're disconnecting from the db. And if it's not running properly from cron on a regular basis, then run it only once from cron and debug that and ensure that it does run fine from cron, before filling up your cron with multiple runs each hour.

        Finally, why is your /usr filling up on a regular basis as a result of this script?

        Some thoughts:
        • Do you really need 225 columns in one table? Maybe you could split data over different tables, keeping columns together based on what they represent - these "classes" shouldn't be difficult to spot among 225 possible keys of a log from an SMSC.
        • It seems that you basically replicate the script inside many directories - I hope this is done via (hard|sym)linking instead of plain copy. You could probably add an input parameter to the script, kept in a single known point: it could increase maintainability.
        • As others said, you should avoid to have them run contemporary. This could mean avoding CRON entirely: I was once biten by a similar problem (collection and elaboration of data from RNCs or from provisioning nodes) and I eventually resorted to using a single scheduling script that runs the jobs *sequentially* instead of in parallel. OTOH, if you need to stick to CRON, try to time the execution time of the different processes, and strive to separate their executions by at least those execution times (in the case of the repetitive tasks it would be probably wise to use 05,20,35,50 for one and 12,27,42,57 for the other).
        • If you fill your disks... you need bigger ones. Probably some monitoring script with some alarm capabilities would help too.

        Flavio
        perl -ple'$_=reverse' <<<ti.xittelop@oivalf

        Don't fool yourself.
        Questions, remarks:
        • How come the script is filling up /usr? Where is it writing, with whose permission, and why? Ideally, the size of /usr should only change when installing patches, or upgrading your Operating Environment.
        • Do you script actually do what they are supposed to do? Do you scripts connect to the database, or are they just hanging there, trying to log on?
        • How fast do your scripts run "by hand"? If it takes 20 minutes by hand, and you start one every 15 minutes, you will run into problems.
        • To avoid having to many instances running, if I write cron jobs making database connections that fire every 15 minutes, I use a lock file to avoid multiple instances from running. Policies can vary: the one failing to get the lock exits, the one failing to get a lock kills the one holding the lock, or a combination of the two (exit if the lock is held by a process that started less then $X minutes ago - else kill the one holding the lock). Waiting for the lock usually isn't a good idea.
Re: cron script best practices
by IOrdy (Friar) on Aug 10, 2005 at 23:34 UTC
    If your script runs often and is prone to overlap where the previous process is still running but cron fires up another one anyway you could use something like File::Pid. If your script sees that the last instance is still running just have it exit clean and wait for the next time cron fires it up.
Re: cron script best practices
by chanio (Priest) on Aug 11, 2005 at 04:11 UTC
    I'm not an expert, but you should be informing us about the output at your good log files.

    If you build good log files (at least updated when in doubt of what your cron is doing), you are going to know exactly what is not working. You could even add a perl script to the crontab that would check the results of your log files and email you about any trouble.

    Reading some lines of your log files, it is easier to know what is not working, the rest is just guessing.

    { \ ( ' v ' ) / }
    ( \ _ / ) _ _ _ _ ` ( ) ' _ _ _ _
    ( = ( ^ Y ^ ) = ( _ _ ^ ^ ^ ^
    _ _ _ _ \ _ ( m _ _ _ m ) _ _ _ _ _ _ _ _ _ ) c h i a n o , a l b e r t o
    Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established
      The point of my post was that I obviously need to revisit and perhaps redesign my scripts and I wanted to see if any best practices were documented for this common pattern. Thus, there was no code or logging results given.

      I'd like to think I'm a reasonable, perhaps even decent debugger... maybe a decent coder with more than 15 years experience. Unfortunately, most of that was as a consultant stuck in situations where I only learned enough of anything well enough to bodge together what I needed to integrate for that customer that week or month.

      Where I'm having an issue is that when I try too embace a technology, most of the materials available is about 1 + 1 = 2, when I already know that, albiet in another language, and I really want to learn calculus. So, I go about reinventing calculus... poorly. For example, perl cron scripts digesting logs and spewing reports have been around 30+ years and perl has been around for almost 20, this pattern should be well documented.

      Another example is using JDBC in JSP. I can find ten billion examples of how to do a basic query, but thats not the "correct" way to do things anymore. you're supposed to put the business logic and query code in beans and the presentation layer in the JSP page. Can I find any design patterns or tutorials on this? No, not unless I what to spend 500 pages learning a product like Struts.

      Anyhow, Chiano, I didn't mean to vent at you and don't think you deserved it... it just happened :). Thanks everyone for your input, you've given me a lot of ideas and I have a few of my own... I'll keep you posted.

      Thanks, :)

      Jimbus

      PS: I should have said I'm filling up /home. I'd thought /home was an alias for /usr/home like in freebsd, but I was mistaken
      Never moon a werewolf!
        If you are working with Java, you must be earning a lot of money (why complain?). You might be interested in some of these books (more than 500 pages) and personal seminars: Bruce Eckell Books. He is the best of his kind and is never mentioned at Sun :) .

        What I find dissapointing of Java is that there is a lot to learn. And that your are never sure that it wouldn't change in the next release. And so much Megas to download!

        Besides, you are right: perl is McGiver-like, all resourceful. You are able to record the output of every perl script at your crontab in a provisional log file to compare it later with the output that you would get after launching every script from command line. Is it the same?

        Perl allows us to return to our basical way of solving everything just by trying things out and perhaps, reinventing several wheels. But you don't have to ask for the rules any longer. You create your own rules...

        { \ ( ' v ' ) / }
        ( \ _ / ) _ _ _ _ ` ( ) ' _ _ _ _
        ( = ( ^ Y ^ ) = ( _ _ ^ ^ ^ ^
        _ _ _ _ \ _ ( m _ _ _ m ) _ _ _ _ _ _ _ _ _ ) c h i a n o , a l b e r t o
        Wherever I lay my KNOPPIX disk, a new FREE LINUX nation could be established