http://qs321.pair.com?node_id=688185

chexmix has asked for the wisdom of the Perl Monks concerning the following question:

Good pre-long-weekend, Monks -

I recently reached a point with my 2nd big Perl project where it seemed to work okay, so I ran it on the whole dataset I needed to chew through.

It took 19 minutes to complete.

Now, since the scientist I'm working with at one point said "It can run all night if it needs to" I figure 19 minutes may be acceptable - BUT: 19 minutes is pretty much forever anyway, so I decided to look into profiling and optimizing it, since I didn't know much about either.

The code for this piece is a few hundred lines long so I don't want to post the whole thing - I hope it will be sufficient for purposes here to DESCRIBE it, and give a sample of some of the profiling output I'm seeing.

I am interacting with a huge database of astronomical observation info. My program takes as input a list of 'observation numbers.'

For each distinct observation number, there many be 1-7 individual 1024x1024 px CCD detectors involved in the observation. So the first thing the program does is query the database to see which detectors are involved.

Then, for each detector that IS involved, it creates a 'virtual CCD' or histogram of the detector in memory by once again querying the database and getting information about each 'hit' on the detector for that observation number.

Whew ... THEN the fun part begins (or so I thought). The program then crunches over the 'virtual CCDs' looking for anomalous 'hit pileups,' #s of 'hits' that go over a certain predefined threshold and _may_ indicate some sort of technical problem with that detector (but which at least deserve closer scrutiny). It does this by passing a sliding window a certain # of px wide over the virtual CCD, in both the x and y axes, and reporting back the # of hits within the window, comparing the count, then, to the established threshold.

Now for the (to me anyway) mystery. When I initially took this on it seemed that what was wanted was something that would hit the db again and again and again with a sliding window to build up the histogram/virtual CCD ... when it was demonstrated to me that this would put an enormous strain on the database, I decided to build the picture of 'hits' in memory first and then iterate over it (e.g. putting the burden on Perl as opposed to the database). I was still hitting the database multiple times for each observation number, but nowhere near as often as I would have done.

Cough. Okay. So I ran the program on a small set (13) of observation numbers, with Devel::Profile and took a look at the results. I was surprised. Here are the first few lines of prof.out:

time elapsed (wall): 23.7359 time running program: 23.7131 (99.90%) time profiling (est.): 0.0227 (0.10%) number of calls: 1572 %Time Sec. #calls sec/call F name 81.65 19.3624 51 0.379655 DBI::st::execute 6.88 1.6309 1 1.630945 Term::ReadKey::ReadLine 5.15 1.2221 63 0.019399 main::process_x 4.79 1.1370 60 0.018949 main::process_y 0.27 0.0629 0 0.062879 * <other> 0.19 0.0446 1 0.044613 <anon>:sliding_bin:23

... so it is STILL the db work that is making the program slow. Is this just in the nature of the beast, or do these prof.out lines indicate I have dones something horribly wrong (or are these top lines not suff to show that)?

Note, the entire dataset is over 500 observation numbers long. So that is a lot of db calls, and a lot of processing of the results of each call.