Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

I cancelled my Christmas vacation

by pg (Canon)
on Dec 22, 2003 at 01:04 UTC ( [id://316281]=perlmeditation: print w/replies, xml ) Need Help??

What is your feeling, if your application behaves randomly, today passes all testing and tomorrow fails every single test case? especailly the application is in production already.

I cancelled my vacation, or to be more precise, my company asked me to cancel my vacation, and carry it to next year.

So there is a big deal. One month ago, some one made a small change to our schedule system (not the schedule system to schedule applications, but the schedule system to schedule trucks and the delivery to our stores), it was all good for one whole month, but at the beginning of last week, two orders were mis-scheduled, and goods got delivered to stores two days later than it should be. That is indeed a big problem, considering that Xmas is the biggest time for my company.

So people started to worry and started to re-test in our UAT (User Acceptance testing) system. Then the most strange thing happened. All test cases failed, everyone had a long face, especially those who did the UAT testing before the change was moved to production system, as they clearly remembered that all test cases passed at that time. So everyone freaked out...some people started to look into the code.

People continued to test and investigate the code, but on the second day, all test cases passed, with no code change. I heard of that, and cautioned everyone that that was no way a good news, but a very very bad news, as to pass the testing no longer meant pass, and simply meant nothing, as it was random.

It continued into the third day, and again, all test cases passed. They even repeated the production problem in UAT, and it passed! Everybody confused, and all kinds of theories started to fly...the mysterious randomness of the system...

Now as a lucky guy, I was caught in the middle of this. That system is written in c, and I am the c guru, at least to others' perception. I started with a little hope, and I actually cautioned everyone not to put too much bet on me, as three people had already banging their heads on this code for three days, and there was absolutely no progress.

Once I started to look at it, I had to agree with others, that the code was really ugly. Nobody in this company wrote it, it was bought from a bankrupted UK company. I wished that they indeed bankrupted before they had sold the garbage to other people.

10 long hours passed, I still had no idea why it behaved randomly. During that 10 hours, I chatted with the programmer who did the changes, and all the changes were just fine to me. At the 11th hour, suddenly a set of qsort and bsearch (they were not touched, but...) caught my eyes. I stared at them for a good 30 seconds, and told myself, yes, that was it!

The qsort and bsearch used a set of fields as the unique key. With the recent change, that key lost its uniqueness. To maintain the uniqueness, a new field needs to be added to form the key. With non-unique key, the result from bsearch is simply unpredictable, and randomly returns the first row it hits with the non-unique key. The result can be changed base on the position of records and the length of the table being searched. A small 10-line code change will resolve this entire mystery.

I still cannot go back to my vacation, as Monday I have to implement this tiny change, and support the UAT testing. Everybody wants it before Xmas and new year!

Merry Christmas and Happy New Year to every fellow monk!!!

Replies are listed 'Best First'.
Re: I cancelled my Christmas vacation
by tachyon (Chancellor) on Dec 22, 2003 at 03:35 UTC

    Sorry you are missing your vacation. We have had a couple of horrors like that. One was detailed at The 10 stages of Bug hunting and the sweet smell of success.

    If that doesn't cheers you up I'll tell you how I spent the first few days of my vacation last week. So I leave the UK with everything humming. Off the plane in Australia at 0130, switch on the phone:

    SMS devel3.xxxxxxx.org has a problem. Please check your email.

    Well a problem was not quite an accurate description. Chernobyl is more like it. Before I left we noted that our new development server only had 1/2 the expected HDD space. It is a RAID V SCSI setup and should have had 143GB available but was only showing 70 odd. This mattered to us as we use this box to stream all our production backups to. So we ask the datacenter to INVESTIGATE why. They decide they have found the problem and decide the redo the containers (of course the take no backups first, nor have we authorised them to change anything). So the short story is that the change to the raid config led to the system rebuilding the containers which apparently crashed at 94%, and the machine refused to boot. In fact it was completely wiped as you expect when you bugger up the raid.

    No great problem I think I'll just restore it from image. But you know Murphy. As luck would have it the machine had been killed while it was writing the daily image to remote backup. We only keep a single full image of this machine due to size constraints and the fact that the data is so volatile (yes that policy is about to change!). And that was now >corrupt<

    Net result - a hand rebuild from source followed by restoration of the devel environment. Although we didnt have a full image we did have everything needed in the sequenced backups but it all had to be done by hand. Just what you feel like with a serious case of Jetlag on your first days off in a year.

    cheers

    tachyon

      Roger, revdiablo and tachyon, Thank you so much for trying to cheer me up. As tachyon's reply is the last one at this time, I thought I just attach my reply here.

      Not be able to have my vacation is not something 100% negative:

      • Now I can continue hanging with you guys, if I go on vacation, I will 100% stop using computer, and just pretending that I never heard the concept of computer
      • I can have a long vacation next year, with the add on of what I carried over
      • Luckily I got it resolved, so the bug actually created joy for me, Thanks.

      It is actually an interesting experience, as the place found the bug was not modified, so one can easily miss it during debug. In this case, it is not someone added a piece of buggy code, but missed a place that required modification. Bascially speaking, an old existing piece of perfectly right code, was turned into a bug. The challenge was to find the bug at a place that unlikely, plus the code was (is) very messy.

      I talked with others that have looked at the code. The fact the code was messy largely reduced their ability to debug. The moment they opened up the code, they started to hate it, and I believe a kind of fear started to grow. Code that is less maintainable is not just technically less maintainable, but in fact is psychologically less maintainable.

      Well indent your code! Code that is not nicely indented can scare people away! This is definitely not a small thing.

      Thanks again, guys!

        Is missing documentation perhaps part of the problem? It looks to me that somewhere someone should have pointed out that parts of the program relied on some fields being unique?

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      i love how that stuff always happens when you're on vacation. last christmas, i was off on vacation and the backup hard drive (just an extra big drive that we used for doing additional quick and dirty backups to) in the webserver died and brought the machine down. fixing it was a simple matter of removing the drive and booting in single user mode to manually change some of the services that were hanging because they couldn't find that drive. not a big deal, but a little more difficult when you have to walk a unix-illiterate coworker through the process on the phone.

      the worst part was that right before i left for vacation, we'd gotten a new server. i'd just gotten it configured but didn't dare deploy it right before leaving for vacation.

      then the last time i went on vacation, the entire northeast US had a blackout. we only ever have downtime or hardware problems when i leave. i think the servers miss me when i'm gone and are just acting out in anger.

      i'm really scared because i'm planning on going to japan for christmas and new year's this year.

Re: I cancelled my Christmas vacation
by revdiablo (Prior) on Dec 22, 2003 at 03:30 UTC

    Good to hear you were able to solve the problem (well, in theory, as you mentioned you haven't actually put it into production yet. But you sound confident...). This should cement your status as guru extraodinaire in the eyes of your coworkers. I have gained a reputation somewhat like this too. Even though I know it's completely untrue, my supervisor likes to say "you can do anything! there's nothing you haven't been able to do!" Unfortunately this comes at a cost -- the projects that you are asked to do, the ones that "you can't fail," will get increasingly difficult. So on that note, I wish you good luck in the future. Hopefully you and I both will be able to keep solving the problems dropped on our plate.

    Merry Christmas, Happy New Year, and Good Luck. 8^)

Re: I cancelled my Christmas vacation
by Roger (Parson) on Dec 22, 2003 at 02:05 UTC
    Good luck pg! Hope that you can still have your vacation before Christmas, now that you have found the source of the problem. Merry Christmas and Happy New Year to you too. ;-)

      I too hope you'll be able to use and enjoy your vacation, pg. You're by far one of the most helpful monks here, to beginners, intermediates, and experts alike; and we all greatly appreciate the wisdom you take the time to share.

Re: I cancelled my Christmas vacation
by SpritusMaximus (Sexton) on Dec 22, 2003 at 16:13 UTC
    I had this happen a while back, though I didn't have to cancel vacation. I was working on a C based MS-DOS based POS system, adding a feature to allow employees to check their timesheet from the terminal. I added the function and went through two weeks of testing with helpdesk and store personnel - it was flawless. Or so I thought. Approximately 2 1/2 months later, on a monday morning, our helpdesk was flooded with calls. Whenever someone went to print a timecard, the system locked up and rebooted. It was consistent, and even in our labs (and on my dev box), the error was there. I tried setting the system clock back to the date of the previous test, and the error was there. After two weeks of hunting, I traced the error to a function which was reading some memory after it had been free()'d, in a library I was using. Why the symptoms of this error had chosen to crop up two months after release, I don't know...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://316281]
Approved by Roger
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-20 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found