Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Optimize my code with Hashes

by jbert (Priest)
on Aug 27, 2008 at 10:46 UTC ( #707127=note: print w/replies, xml ) Need Help??

in reply to Optimize my code with Hashes

Your process takes 18 hours to run. You want it to be less.

You need to work out what the computer spends most of it's time doing in those 18 hours, so that you can try and reduce the most costly elements first (and get the biggest improvements).

Prime possibilities are:

  • Waiting for a network response (latency). A lot of network protocols are request/response. If you need to do 50k requests and a single request takes 1s to process remotely, you'll end up waiting 50k secs == 14hours.
  • Waiting to swap stuff in+out. If your app working set is bigger than the RAM available, performance will plummet as it waits for the disk to access virtual memory.
  • CPU. If your app isn't waiting for anything else, it's trying to crunch through code. This is the case where profiling etc can help the most.
  • Other disk I/O. If your process is reading and writing to storage, that can be slow. Particularly if you are on a slow device and/or are doing a lot of flushing after small writes (but sometimes that is that you need to do).

So you need to run some monitoring tools ('top' is a good first start, but solaris has many others - check out vmstat and iostat too). These will tell you which of the above issues is the problem. (Well, if it's not CPU, swapping or other disk I/O, then it's probably network latency.

And after all that about measuring first?

My best guess is that you're doing 50K LDAP operations and network latency is killing you (i.e. your box isn't busy when you're doing this, but perhaps your LDAP server (or network) is).

Look for a bulk import/export tool for your LDAP server and use that instead.

Replies are listed 'Best First'.
Re^2: Optimize my code with Hashes
by sukhicool (Initiate) on Aug 27, 2008 at 11:05 UTC
    Total code took : 68258 secs

    While LDAP Update/Add entry took:

    the code took:67358 wallclock secs (29935.47 usr 60.38 sys + 0.00 cusr 0.00 csys = 29995.85 CPU) in considering each entry from PeopleFirst extract ...

    1. Do older versions of PERL has bad algorithms for handling of HASHES ?
    2. Will it help if we use Arrays instead of Hashes?

    If upgrading the PERL version will help, I will try to convince the management if you can help me in locating some url which quotes this.
      the code took:67358 wallclock secs (29935.47 usr 60.38 sys + 0.00 cusr 0.00 csys = 29995.85 CPU) in considering each entry from PeopleFirst extract ..

      That's interesing. So 29935/67358=44% of your time was spent on user CPU. That is significant, and you might want to look into profiling the app's CPU usage (using Devel::Profile and Devel::DProf).

      Of course, it also means that 56% of your time is spent doing other things. If that is net latency then you'd do well to look at bulk import/export instead.

      Your timestamp logging appeared to show ~6secs for one request, is that right? That can't be representative, since as noted elsewhere in this thread, you'd never manage to do 50k updates in 18 hours if each takes 6s.

      Lastly, if you do profile the app, then it will probably benefit you to produce a cut-down version which runs more quickly. This is useful because the profilers slow things down and generate large amounts of data - they'll probably break on such a big run.

      Also, having a more quickly repeatable test case (e.g. ~10mins) will greatly accelerate your ability to test ideas on code and algorithm changes.

      However, the hard part is knowing if your cut-down test case has the same performance profile as your main job run.

      Another thought: if the 'missing' 56% of your time is overnight you might be sharing a network with a backup job, or something else which saturates the net and makes your network response times go very slowly.

      You are asking the wrong questions. You already blame the hashes without really knowing what wastes all this time. Everyone above told you to do some profiling first and that is a really good idea. Often it is the algorithm used that kills the time.

      For example: If your program often makes a copy of your hash then that will really thrash your memory and cost time. But you won't get it faster by changing to an array then.

      Also you didn't show us what the code does with the hash you blame. Then how should we be able to tell you if an array ist better

      So either post some code here or do some profiling or both.

        Sorry if anyone is hurt by my wrong questions.

        Let me show some code:
        The following hash is getting generated from a subroutine: %pfinfo = { '069836' => '069836|Henion,David|A|Active|010474|HAWKEY,Mi +chael G|SC3798|...' , '025939' => '025939|Picard, Stephane|A|Active|010101|LEPINE,Thi +bault|SG8778|...' , ...} my $timee0 = new Benchmark; foreach my $en (keys %pfinfo) { logAndSkip(\*LOG,"Considering the entry from PeopleFirst extract: $e +n...") if ($log); # Get the PF information my @pfi=(); #reset the array @pfi=split/\|/,$pfinfo{$en}; # If employee number does not exists in ED, it looks like a creation if (!exists $ed_en{$en}) { createEDentry(\*LOG,\@pfi,\%used_dn,\%en2dn); } # Looks like an ED entry update else { updateEDentry(\*LOG,$en2dn{$en},\@pfi,\%en2dn); } } # End Foreach

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://707127]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2021-04-11 18:40 GMT
Find Nodes?
    Voting Booth?

    No recent polls found