Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Up for Critique

by biograd (Friar)
on Mar 22, 2002 at 20:31 UTC ( [id://153640]=perlquestion: print w/replies, xml ) Need Help??

biograd has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks-

I've been working on a project for my degree, and it's about time to be done with the thing. This script is just one part it. (See my bio for more info.) Anyway, I'm feel like I've never had my code looked over by someone who knows what they're doing, so I posted the db-loading program on my scratchpad, along with a sample of the gene data input. If you feel like you have some spare time, I'd appreciate some constructive criticism.

(I didn't use a generic module because I knew that wouldn't be in the spirit of the project as my advisor saw it. If you'd like to advocate your favorite module, I'll try to study it before the next time I have to do something like this. :) )

My main concern is that the program takes too long to run. (14- 17 hours for 26,000+ records.) There is indexing on several of the fields. I've heard that if I didn't index while inserting data, everything might go faster. I've also heard that the indexing makes the multiple db searches needed go faster during the program, so any gain from indexing at the end is lost from the extra search time within the script. erg. As for the script, I did pass array references to the subroutines instead of copying arrays, but what are some other practical ways that you would optimize it for speed? It blazes on the first thousand records or so, then gets slower as it goes on. I expect that to some degree, since it has to search ever-increasing tables as it progresses, but is 14-17 hours realistic? I am seeing some stuff on optimizing the many regex in Programming Perl (pp 594-9), but I'm not sure the best way to apply it. I have read that tr// is faster than s//, which I could use in a couple places.

Thanks for any comments you can make. If you have questions or need more information, I'll oblige.

-Jenn

Replies are listed 'Best First'.
Re: Up for Critique
by dragonchild (Archbishop) on Mar 22, 2002 at 21:15 UTC
    First things first - the script is very readable. I'm impressed! :-) Also, there is almost enough commenting. *winks*

    To answer your question - yes, there are a number of places you could optimize for speed (and maintainability!).

    1. The biggest one (for both aspects) is to declare your variables when you use them. This improves maintainability because you're scoping variables to only where you need them. This improves speed because you're not having to clear the arrays each time.
      ### FILL ACHROMS TABLE ########################################### # Take in whole records separated by five dashes instead of a # newline character. $/ = "-----"; while($record = <INFILE>) { # Print out a counter to show how far the filling has progressed. # print "$n\n"; # $n++; @newrecord = (); # Send the record to have any blank lines removed. @newrecord = CleanData( $record ); ----- ### FILL ACHROMS TABLE ########################################### # Take in whole records separated by five dashes instead of a # newline character. $/ = "-----"; while(my $record = <INFILE>) { # Print out a counter to show how far the filling has progressed. # print "$n\n"; # $n++; # Send the record to have any blank lines removed. my @newrecord = CleanData( $record );
    2. The second is to pass around references, not the whole array itself. For example:
      @newrecord = CleanData( $record ); ---- my $newrecord = CleanData($record);
      There, $newrecord is a reference to an array. You'd get your stuff by using $newrecord->[5] instead of $newrecord[5]. If your arrays are large, this is a signficant savings.
    3. CleanData() should be using grep. Something like:
      sub CleanData { my ($record, @cleanrecord, $line, @record); $record = $_[0]; ### Separate the record by newlines. @record = split /\n/, $record; ### Remove empty lines if they exist. foreach $line ( @record ) { chomp( $line ); next if ( $line =~ m/^\s*$/ ); push @cleanrecord, $line; } return ( @cleanrecord ); } ---- sub CleanData { my $record = shift; my @cleanrecord = grep { /\S/ } chomp(split($/, $record)); return \@cleanrecord; }
      That is much faster and makes much clearer what you're doing. The $/ is the newline separator for your system. It makes your script portable to Windows or Mac (for example).
    4. Use array slices. For example:
      $sth_update_exon -> execute( @exondata[1,2,3,4], $geneid, $exondata[0] );
      They're faster as well as more readable.
    5. Don't use fetchrow_array(). Use fetch() with bind_columns() instead. This is how the DBI documentation recommends to do things the fastest. In addition, it seems like you're just using fetchrow_array() to find out if anything matched at all. Using fetch() is definitely quicker in that instance. (There's probably an even faster way just to check for existence, but I don't know it.)
    6. In ParseAChroms(), you're doing a regex against the same variable over and over. In addition, most of those regexes look to be identical. This makes me suspicious. Try something like:
      sub ParseAChroms { my $achromref = shift; my $matches = join '|', ( 'BAC:', 'LENGTH:', 'OSOME:', 'CHR_START:', 'CHR_END:', 'BAC_START:', 'BAC_END:', 'ORIENTATION:', ); my @achrom; my $line = join '', @$achromref; push @achrom, $line =~ /\b([BY]AC\b/; push @achrom, $line =~ /(?:$matches)\s*(.*?)\s*/g; return \@achrom; }
      Now, this assumes that all your stuff is going to be in the right order. If you can't, don't use an array - use a hash.
      sub ParseAChroms { my $achromref = shift; my $matches = join '|', ( 'BAC:', 'LENGTH:', 'OSOME:', 'CHR_START:', 'CHR_END:', 'BAC_START:', 'BAC_END:', 'ORIENTATION:', ); my $line = join '', @$achromref; my %achrom = $line =~ /($matches)\s*(.*?)\s*/og; $achrom{TYPE} = $line =~ /\b([BY]AC\b/o; return \%achrom;
      Now, what's even cooler is that you can access stuff by name, not array index. This is important if you ever plan on changing your input file stuff, maybe adding a new thing you want to track?

      You could even make that regex ahead of time by using qr//, but I'll leave that as an exercise to the reader. And, yes, ParseGenes(), ParseExon(), and ParseSubExon() should be treated the same way.

    7. A few style notes:
      • Your variable names could be improved with either capitalization (like your subroutine names) or underscores. Doesn't matter which one, just be consistent.
      • When you override $/ (or any similar variable), use local and do it within a block. That makes for good maintainable code.
      • Instead of making a variable for each sth, make a hash of sths and access the one you need. Even better would be to have a subroutine somewhere that does this for you.
      • Don't use a variable named temp. Ever. You can find a better name. Also, when you do, it looks like your just using it as a boolean test. If so, then just do the thing in the boolean test. (if and while take boolean tests, in case you're wondering.)
      • Don't use the same variable name for both a scalar and an array (such as the aforementioned temp). It's confusing and leads to (sometimes subtle) bugs.
      • Use heredocs with your SQL statements for prepare(). It's more readable. And, put it into a subroutine (preferably in some other module!) Something like:
        my $match_achrom_statement = << STH_MATCH; SELECT ac_id FROM achroms WHERE ac_id = ? STH_MATCH my $sth_match_achrom = $dbh->prepare($match_achrom_statement);

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Heh...yeah, Dragonchild, they want MORE documentation than this. (!) That's OK. I'm basically a language oriented sort of person anyway. Documenting doesn't bother me much.

      You'll notice the date on the script is from last year. There are some things I do now that I didn't do then, such as initializing the locally scoped variables ("my") where they are used instead of way up at the top. Most of the suggestions I see in your note and others I just wouldn't have thought of yet though. So, thanks to all for the ton of good suggestions. I'm not sure how many I'll have time to implement before my deadline, but this is a great help for improving my skill level.

      -Jenn

Re: Up for Critique
by tadman (Prior) on Mar 22, 2002 at 21:07 UTC
    Here's just something that came to mind. Instead of using your massive if structure in ParseAChroms, you could try doing something a lot simpler, such as declaring what you're looking for in a sort of table:
    my %acrhom_prefix = ( BAC => '(.*)\s*', LENGTH => '(.*)\s*', OSME => '(.*)\s*', CHR_START => '(.*)\s*', BAC_START => '(.*)\s*', BAC_END => '(.*)\s*', ORIENTATION => '(\w)\w*\s*', );
    Now, instead of using an "array" of variables, use a hash. This simplifies things massively:
    sub ParseAChroms { my ($achromref) = @_; my (%value); foreach (@$achromref) { my ($prefix, $data) = /^(\S+):\s+(.*)/; if (my $pattern = $achrom_prefix{$prefix}) { ($value{lc $prefix}) = $data =~ /^$pattern$/; } } return @data{qw[ type ac_id length chr_num chr_start chr_end bac_start bac_end orient]}; }
    Now, I've just typed this into the editor here, so this is really just theory and not tested, but the idea is that this can be applied to a few other areas where you do the same thing.

    Further, instead of passing arguments back and forth in a particular fixed order, why not return a HASH-ref? This makes it easy to add new data to the return value without breaking other functions which use it.
      There were a couple reasons I went for the wholesome all-array approach...one of them is I have just never been as comfortable with hashes as I have been with arrays. This is no excuse, of course, and I've been meaning to use them more. I just always see the array algorithm first. (Can you tell I'm new???) Second, I'd read in Mastering Algorithms with Perl about the speed of arrays over hashes, so I felt justified. (Webadept noted this below too.)

      However, I can see the utility of hashes from many of these examples, yours and others below, so I know I'll have to bite this bullet soon. Thanks for the example, tested or not. :)

      -Jenn

        Speed vs. Maintainability/Scalability

        If the difference in speed is small, i say drop the arrays and use hashes ... but i am lazy. I have this adversion to something called 'typing'. ;)

        But, arrays are not always faster than hashes. It all depends upon context - if you have to scan the entire array to find something, a hash implementation will probably be faster. But if you have to iterate over all elements - then an array is probably the better choice. Consider the following Benchmark code:

        use strict; use Benchmark; my %h; my @a = ('a'..'zzz'); @h{@a} = (1) x @a; timethese('-10', { find_array => sub { return (grep /^ccc$/, @a) ? 1 : 0 }, find_hash => sub { return $h{'ccc'} }, iterate_array => sub { do {} for @a }, iterate_hash => sub { do {} for keys %h }, }); __END__ yields on my dual proc 400 linux box: (YMWV) find_array: 10 wallclock secs (10.11usr + 0.04sys = 10.15 CPU) @ 45.62/s (n=463) find_hash: 13 wallclock secs (11.41usr + 0.07sys = 11.48 CPU) @ 1183591.72/s (n=1 +3587633) iterate_array: 11 wallclock secs (10.98usr + 0.09sys = 11.07 CPU) @ 70.28/s (n=778) iterate_hash: 11 wallclock secs (10.78usr + 0.05sys = 10.83 CPU) @ 16.53/s (n=179)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: Up for Critique
by scain (Curate) on Mar 22, 2002 at 21:21 UTC
    biograd,

    This is just a few quick comments after looking at your code:

    • Your program works: congrats; all else is gravy.
    • There are several .* 's in your regexes; be sure to take a careful look at those. Could they be more carefully defined?
    • The fact that your script zips through the first several records indicates that there is a database problem more than a perl problem. Certainly the queries are getting slower as the database grows. Try turning the indexing off; I understand that you will lose some speed with your selects. The only way to know which way is better is to try it.
    • Carefully consider if you really want to do all of those selects while you are inserting your data. Could some of the information be stored in a hash or array? That would really boost your speed.

    Anyway, it largely looks like good work. You should be very irritated with your university for not letting you take database design classes. Your advisor should be going to bat for you here. If you want some back up from someone in industry, /msg me.

    Scott

Re: Up for Critique
by derby (Abbot) on Mar 22, 2002 at 21:16 UTC
    For starters drop the while( $record = <INFILE>) and try the more idiomatic while( <INFILE> ) then use the implicit $_ instead of $record.

    Use references to pass into and recieve back from your subroutines. Don't bother initializing them in the calling block:

    @newrecord = (); # Send the record to have any blank lines removed. @newrecord = CleanData( $record ); -vs- $newrecord = CleanData( $record ); sub CleanData { my( $record ) = $_[0]; my ($cleanrecord, $line ); ### Separate the record by newlines. @$record = split /\n/, $record; ### Remove empty lines if they exist. foreach ( @$record ) { chomp; next if /^\s+$/; push @$cleanrecord, $_; } $cleanrecord; }

    Also, take a look at your regexes. Try to reduce those to a more reasonable number (and slap the o option on the end to optimize them).

    if ($line =~ m/CHROMOSOME:\s*(\d*)\s*$/) { $chrome_num = $1; } # Get TIGR model's reference id if ($line =~ m/MODEL_NAME:\s*(.*)$/) { $tigr_id = $1; } # Get the locus on the bac where the gene was cloned. if ($line =~ m/BAC_LOCUS:\s*(.*)\s*$/) { $bac_locus = $1; } .... might go better as if( $line =~ m/(.*?):\s+(\S+)\s+/o ) { $type = $1; $info = $2; } if( $type eq "CHROMOSONE" ) { # do whatever } elseif( $type eq "NUM_FOO ) { # do num foo }

    You get the idea. Running multiple regexes against a single line can take forever. Try to reduce the regexes into simpler one(s) that will run once against the line and also ensure you use the 'o' flag so you're not spinning your wheels always compiling the regexes.

    But I think all of those type changes are going to have a minimal impact. You really need to concentrate on your dbs performance and whats going on there. I'm not sure about mysql but some database will commit "transactions" so many inserts (does mysql/innodb table do that?). You can set up your drive to make that number larger and reduce the number of commits.

    They're may be other stuff but that's just one person's opinion. It's kinda hard to do without being able to run the stuff. Have you tried profiling the script to see where you're spending all your time? At a minimum, you could liberally sprinke Benchmark timings before and after sections of the code and then use timediff to see how long the snippet took.

    Good luck. ++ for an interesting post.

    -derby

    update: ferrency is right about the optimizations. sorry for the mislead.

      derby wrote: and also ensure you use the 'o' flag so you're not spinning your wheels always compiling the regexes.

      AFAIK, the 'o' flag is only useful if your regex contains variables. If your regex is a constant at compile time, it will only be compiled once in any case.

      biograd coded:

      # Get reference number. if ($line =~ m/BAC:\s*(.*)\s*$/) { $ac_id = $1; }

      This regex may not do exactly what you want it to. Specifically, it's not going to remove trailing spaces from the line. The (.*) is greedy, and the following \s* allows a match of 0 whitespace characters, so the whitespace will always end up being matched in the ()'s.

      There are many ways to fix this, but the easiest might be to simply match (\S*) instead of the . which some people believe should be avoided at all costs. However, I like derby's solution, to replace all of the regex's with one more general regular expression.

      Update: Looking more closely at your code, your biggest bottleneck is almost definitely in the database and not in the perl. If you can avoid interspersing selects and inserts by keeping a copy of data you've already inserted, you'd be able to remove the table indices until it's built, and gain a lot of speed. Alternately, 26k records aren't really that many... you might consider preprocessing the records in memory into a list or hash, and then insert them all at once instead of processing the data in the order it is in the input file.

      Alan

        ferrency- I agree the biggest problem is probably the db calls. My techie fiancee also read your update and said he'd thought of that the whole time...

        If I get a chance to do a major re-work I will almost surely sort everything into memory first, then do a flurry of inserts/updates. I graduate this May, but I'm interested in getting this thing as fast as possible. I leave it in the hands of my advisor and his future grads to maintain and use (maybe as a chron-job, or whenever TIGR sends out an update) and I'm having pity on them. ;)

        -Jenn

      Thanks for the ++, derby. I like the regex squashing you did.

      The MySQL dbms doesn't do transactions in the version we have. I think the newest version is incorporating them. Tables are of type "MYISAM".

      For any intersted persons...Version info:

      mysql Ver 10.12 Distrib 3.23.25-beta, for pc-linux-gnu (i686)

      -Jenn

Re: Up for Critique
by webadept (Pilgrim) on Mar 23, 2002 at 03:39 UTC
    This has mainly been a Perl discussion, as it should be on the monks, but there have been a few suggestions here to turning off indexing, and what not. So I thought I would throw my two bits in regarding that subject.

    I'm going to use a simple data structure here.
    CREATE TABLE whois_e ( id_num int(11) DEFAULT '0' NOT NULL, email int, url int, query char(128), sent int(11) DEFAULT '0' NOT NULL, );
    Now if you create indexes on id_num, email, url and sent like this
    create index whosis_1 on whois_e(id_num); create index whosis_1 on whois_e(email); create index whosis_1 on whois_e(url); create index whosis_1 on whois_e(sent);
    and do a query something like this
    select * from whois_e where email = '$email' and url = '$url' and sent = '1'
    then you aren't going to see a significant amount of speed, in fact things will probably be a bit slower.

    Indexes should be designed sparingly and to the applications need. If this is a query I'm going to use often, the index should be set up more like this.
    create index whosis_2 on whois_e(email, url, sent);
    This will improve this query by a noticeable amount. Also any data file should have a primary key, I do this whether it needs it or not.
    CREATE TABLE whois_e ( id_num int(11) unsigned DEFAULT '0' NOT NULL auto_increment primary + key, email int unsigned not null, url int unsigned not null, query char(128), sent int(11) DEFAULT '0' NOT NULL, );
    Indexing will speed up a data base incredibly, but it needs to be done right and sparingly. I've seen some programs with indexes on every field, and that just makes two data tables, what's the sense in that? :-)

    A table without an index of some sort is just a collection of unordered rows. To find a row or set of rows from such a table, the engine has to examine every single row in the table. Even if it finds it at the top, unless you've put a limit on your query, its going to search through the rest of them. This is wasted time and CPU power. Using an index however the values are sorted and keyed to the table. The engine knows if it finds the answer to stop looking, because it knows anything that follows will not equal the query set. That's a huge advantage.

    Looking at your queries and finding some common keys in the "where" area, will help you build suitable indexes for those tables.

    If you are going to use a join, any type of join, those fields should always be indexed, otherwise even on small datasets your programs going to crawl.

    Now, the disadvantages are in the writing to the files with indexes. Here is where indexes are working against you, because they have to be inserted and updated as well. Some hints on indexing and loading tables follow, in no particular order, just off the top of my head.

    Bulk loading is faster and better on indexed tables. Use LOAD DATA instead of INSERT when ever you can. Index flushing happens less often and the server needs to pares and interpret one statement, not several

    If you must use INSERT then use the form that allows multiple rows to be specified in a single insert statement.
    INSERT into whois_e VALUES(...), (...),(...), (...).....;
    The more rows you can specify in the statement the better.

    One trick is to create the table, load all your data and then add the indexes, this comes in handy. If you can't do that, drop the indexes, load the data and then recreated them. Do this only if you can separate your program into a single load area, and then the query area. Since your program is structured the way it is, then this could be an option, on other programs it may not be, and probably won't be, because other users are on the database at the same time, and you'll get a lot of phone calls from angry accountants and beavers. The main question to ask yourself is "can I do this only once" if the answer is no, then don't do it, it will slow your program down.

    Hope this is some use to you.. there's a lot of good stuff in this node and some of it I'm writing down and playing with, so thanks to all the rest of you..

    That's my two bits... what?... I'm over extended?.. damn.

    ---UPDATE --

    Something I just read seemed applicatble thought I would post it in here. According to the O'reily Mastering Algorithms with Perl book -> " page: 26 Efficincy Tip: Hashes Versus Arrays: It's about 30% faster to store data in an array than in a hash. It's about 20% faster to retrieve data from an array than from a hash". -- end update --

    Glenn H.
Re: Up for Critique
by admiralh (Sexton) on Mar 23, 2002 at 04:40 UTC
    The first thing I would suggest is running 'monitor' while the program is running. This should give you an idea is the time is being spent in your Perl script or in the database. This would help you focus your optimization effort. I worked with bioinformatic data professionally myself, and the way I would typically do these type of problems is to pull in all the data I could into memory first (hashes work wonders), process your results in memory, and then do your table inserts and updates (updates first). You want to do multiple updates and inserts in a single transaction block, but be sure to commit every so often so your backtrack space doesn't fill up.

    Here are a few coding comments I have:

    1. Use $INPUT_RECORD_SEPARATOR instead of $/
    2. We can eliminate a lot of lexical scoping on the parse routines since the m// function returns as a list the substring matched by the capturing parentheses. If there is no match, the list item will be undef.
      # Get type. Right now, all are BAC's. if ($line =~ m/\b((B|Y)AC)\b/) { $type = $1; } # Get reference number. if ($line =~ m/BAC:\s*(.*)\s*$/) { $ac_id = $1; } # Get whether forward or reverse if ($line =~ m/ORIENTATION:\s*(\w)_(\d)\s*$/) { $orient = $1; $read_frame = $2; }
      becomes
      # Get type. Right now, all are BAC's. ($type) = $line =~ m/\b([BY]AC)\b/; # regex change # Get reference number. ($ac_id) = $line =~ m/BAC:\s*(.*)\s*$/; # Get whether forward or reverse ($orient, $read_frame) = $line =~ m/ORIENTATION:\s*(\w)_(\d)/; # regex change
    3. Watch all the *'s (especially .*'s) in regex's, they may not all be necessary.
    I think everything else has been said already. I'd like to reiterate the utter stupidity of the CS department not allowing you to take their classes.
      1. Use $INPUT_RECORD_SEPARATOR instead of $/

      I was wondering why you thought this was a good idea. IMO the aliases from English.pm can end up making your code harder to follow - purely because they are used so rarely and most people are far more used to using the punctuation variables.

      Also there used to be a performance problem with use English. It made Perl think that you'd used $`, $& and $'. This would slow down all regex matches in your code. I think this has been fixed in recent Perls, but I'm not sure exactly when (or to what extent).

      --
      <http://www.dave.org.uk>

      "The first rule of Perl club is you do not talk about Perl club."
      -- Chip Salzenberg

        Because I was not familiar with $/ and had to look it up. :-)

        Seriously, since this code was written to be looked at by other (non-Perl) types, I figured the long name would help.

        Now if it seriously causes performance problems, then by all means use $/, but comment what you are doing. And using local to scope the change is a good idea.

        BTW, that was my first post to Perl Monks.

      Thanks for the comments admiralh. I must have missed the part about $INPUT_RECORD_SEPARATOR in Programming Perl. It is, after all, described on one of the most evil of pages (3rd ed.). ;)

      Question: Is 'monitor' similar to the *nix shell command 'time' ? I just discovered that one.

      -Jenn

        Yes monitor is similar to time. If you want to get really blood and guts try strace which traces system commands and calls. Often when there is a bottle neck it can aid detection (quickly).
Re: Up for Critique
by Anonymous Monk on Mar 23, 2002 at 03:12 UTC

    One wild suggestion ;-); Don't underestimate the speed of tied DB hashes. As an entry example: $hash{"$nodename"}=$data, $hash{"$nodename".'description'}=$description, $hash{"$nodename".'date'}=$date (if you don't need to split upon retreival), $hash{"$nodename".'date'.'month'}=$datemonth (saves a split upon retreival) etc. By implementing a pseudo java-like namespace scheme you can make some very fast data entry and retreival scheme. Not tested on this ammount of data but certainly tests faster on around 10000 records with many entries (based on / down your figures). There are as ever downsides such as having to write your own retreival routines and then making them interoperable or file locking problems.

    I believe newer versions of the BSD-DB have transaction support (if you comp dept supports them - they are not free). However you can use any_DB if multiple platform coverage is essential.

    Plus all of the above (and possibly what comes below too).

    HLA-B?? rocks!

Re: Up for Critique
by davorg (Chancellor) on Mar 23, 2002 at 17:01 UTC

    In a couple of places you change the value of $/ to make it easier to read in your input records. In general when changing any of Perl's "special" variables, it's a good idea to get into the habit of localising that change as much as possible. For example:

    { # start a new code block local $/ = "-----"; # do whatever processing you need to do with the # changed value in $/ } # End of block, $/ reverts to previous value

    With the change localised like that you don't run the risk of forgetting what value is in the variable later on in your program. This is particularly important when you're writing modules for other people to use as you don't want to be changing the value of these variables in someone elses code :)

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

When in doubt, sort
by Thelonius (Priest) on Mar 23, 2002 at 03:46 UTC
    A generally good strategy for a load is to parse the data, writing the data out into one file for each table. Then sort the data in each file on the field(s) that you are doing your SELECT on.

    If the database is empty before your run (it's not clear from your description), then you don't need to index until you are done. Any duplicates will be in consecutive records after you sort, so you will know when there are duplicates.

Re: Up for Critique
by dws (Chancellor) on Mar 23, 2002 at 20:43 UTC
    A quick back-of-the-envelope shows that you're taking about 1/2 second on average to process a record.

    I agree that a big chunk of time is going to database overhead. Assuming you have fields correctly indexed, I can think of two things that might help:

    1. MySql provides a DELAYED option for INSERT (documented here), which makes the INSERT faster by delaying disk writes. This can be risky. If your script is the only one adding to the database, and you're reasonably assured that the machine won't crash in mid run, try it.

    2. Look for opportunities to preload data for testing, rather than querying for data at runtime. Tests like   SELECT ac_id FROM achroms WHERE ac_id = ? can be done faster in memory. Your start-up time will take a hit, but you can amortize that hit over lots of records, and eventually you'll be ahead of the game.

Re: Up for Critique
by jpm (Novice) on Mar 23, 2002 at 17:16 UTC
    I'm not a perl expert, so please forgive me (you and the Monks) if this is a silly suggestion, but in your position I'd consider doing it all in memory. You said 26,000 genes, so how many nucleotides per gene - could it be around 1,000, or is that way too low? If 1000, this gives only 26M nucleotides.

    If we use a character per nucleotide, this is only 26M bytes. If these figures are low, we could use 2 bits per nucleotide (as there are 4 different ones), giving an increase of a factor of 4 over just using bytes (you'd have to write pack and unpack functions).

    So I'd download the data from the database(s), do my processing in memory, store any results back, and display the results on some kind of report.

    Please contact me if you want to get into details (non-perl related!).

Experiment!
by Thelonius (Priest) on Mar 23, 2002 at 23:20 UTC
    The other thing I was going to say last night, but was too tired to remember, was that you can do some simple experiments.

    First, make a copy of your program without any database calls and run it. It should run in a matter of minutes. That will confirm whether it is really the database part that is slowing things down.

    Another quick experiment is to create a database without indexes, and create a version of your program that does only INSERTs, no SELECTs or UPDATEs. If this runs fast, then you should go ahead and use my sort suggestion, above.

    Besides the sorting idea, there are some MySQL-specific things you can look into. There's a (non-standard) REPLACE statement that I think you can use instead of using SELECT followed by UPDATE or INSERT.

    Depending on version, there is also a LOAD DATA INFILE statement.

Re: Up for Critique
by Anonymous Monk on Mar 23, 2002 at 05:06 UTC
    What sort of machine are you running it on? You may get a significant performance increase just by spening $100 on some more RAM or a faster processor.

    It's often a lot cheaper if you can control the server to upgrade the machine than spending weeks getting better performance out of an older machine.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://153640]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-23 22:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found