http://qs321.pair.com?node_id=367076


in reply to Iteration speed

Describing your problem in terms that only another biochemist will understand means that most of us here will only be able to guess at what your program needs to do.

The best way to speed up iteration is to avoid iterating. Lookups are fast.

If your dataset is too large to fit in memory forcing you to re-read files, then the first pass I would make is to avoid having to re-parse the files each time. A pre-processing step that parses your files into convenient data structures and then writes these to disk in packed or Storable binary format would probably speed up the loading considerably.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon

Replies are listed 'Best First'.
Re^2: Iteration speed
by jepri (Parson) on Jun 16, 2004 at 13:05 UTC
    Oh, there's a few of us around :)

    The problem, as noted by others, is that we can't see his code to make suggestions. Shrug. Can't help much there. He doesn't even say if he's using the Perl bioinformatics modules or if he's rolled his own.

    In any case though, this is a problem that is begging for a parallel processing solution. In general, I'd recommend he break up the dataset and run it on all the machines in the lab. I doubt that there are many algorythmic improvements that can beat adding another 5 CPUs to the task.

    ___________________
    Jeremy
    I didn't believe in evil until I dated it.

      I know there are a few of you guys around, but the description left me (and a few others from the responses) completely cold :)

      Belatedly, I have begin to think that this problem is related to a previous thread. If that is the case, I think that an algorithmic approach similar to that I outlined at Re: Re: Re: Processing data with lot of math... could cut the processing times to a fraction of a brute force iteration. As I mentioned in that post, my crude testing showed that by limiting the comparisons to a fraction of the possibles using an intelligent search I can process 100,000 coordinates and find 19000 matching pairs in around 4 minutes without trying very hard to optimise.

      I agree that a distributed search would perform the same task more quickly but the additional complexity of setting up that kind of system is best avoided if it can be. And if this is the same problem, that is easily possible. My test code from the previous thread is under 60 lines including the benchmarking.

      What stops me offering that code here is 1) A clear decription of the problem. 2) Some real test data.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon