http://qs321.pair.com?node_id=454953

jacques has asked for the wisdom of the Perl Monks concerning the following question:

This DBI statement is taking too long and I need to speed things up:
my $sth = $dbh->prepare("select * from campaigns where mw_export='yes' + and last_update > ?"); $sth->execute(last_transfer{'campaigns'}); while(my $result = $sth->fetchrow_hashref()) { print "It takes me over 30 minutes to print this!!!\n"; }
I have about 40,000 rows with each row containing about 200 fields. My program hangs for a loooooong time. So I know I have a few options:

1. Use indexes.
2. Use fetchrow_arrayref instead of fetchrow_hashref.

I created two indexes but didn't notice any speed improvement:

alter table campaigns add index mw_export (mw_export); alter table campaigns add index last_update (last_update);
Should I create a different type of index? Those are the SQL statements that I used.

The other option is to use fetchrow_arrayref. But I *need* to have $result as a hash ref. Would it be okay to convert the array ref to a hash ref? What would be the best way to do that?

I heard that fetchrow_arrayref is way faster than its hashref counterpart.

Any other things that I could try?

Replies are listed 'Best First'.
Re: DBI speed up needed on MySQL
by dbwiz (Curate) on May 08, 2005 at 06:34 UTC

    There are two Tutorials that you should look at.

    • DBI Recipes tells you what you can do, and gives you some interesting trick about usiong hashes efficiently.
    • Speeding up the DBI tells you how to find which part of your program needs improvement, and several ways of doing it.

    HTH

Re: DBI speed up needed on MySQL
by Ovid (Cardinal) on May 08, 2005 at 16:03 UTC

    There's not enough information to properly fix your problem, but here are a few thoughts. I guess the most obvious question is about your use of "SELECT *". Do you really need to select all of those fields? That's 8 million fields that you are selecting. (Personally, I'm rather suspect about any table that requires 200 fields (it suggests to me that it's not designed properly.) And what are you printing to? Are you sending the data to a socket, to STDOUT, to a file? Those can all have significant performance issues. If you're printing to STDOUT, try printing to a file and see how much faster it is.

    And yes, fetchrow_arrayref is faster. Perhaps you can use that and a hash slice? I haven't benchmarked that, so I can't comment on the performance aspect. However, you say that you need to have $result as a hashref. Why?

    Last question: are you sure those indexes are useful? The SQL statements that you use can suggest indexes you might need, but the data is important, too. Not every field will benefit from an index. Also, have you used a profiler to find out where your SQL is really bottlenecking? (I assume MySQL will let show you query plans). The statement itself seems pretty simple, but whether you're IO bound or CPU bound could make a difference. Talk to a MySQL DBA about this issue. A good DBA is worth their weight in gold.

    Cheers,
    Ovid

    New address of my CGI Course.

      To build on what ovid said, you might want to look to see how much of the time taken was from mysql. Dump the SQL that perl is using for its query, and then run that in the mysql client. If that takes a significant time, you might take a look at EXPLAIN. It's also necessary to ANALYZE your tables when they change significantly, so that the database can determine if the indexes are going to be a benefit or not. (it can be faster to not use an index unless you're throwing away the majority of the records, as you require more table reads... for oracle, I've heard that number is between 5 to 20% of the records kept.. so if you're throwing away less than 4/5 of the records, a full table scan may be the way to go.

      Memory sizing of the system can also be important -- out of the box, mysql uses very little memory (in my opinion). If you have a 2GB dedicated server, it does no good if mysql is told to only use 256MB. (it might be 384MB as the default...still rather small for some people).

      Anyway, if you find that the perl process is taking 30 minutes, but it's only taking 5 minutes for mysql to do its work, then you'll want to look at the Perl side of things. If it's taking 25 minutes in MySQL, then you'll want to look at adjusting the SQL statement, or tuning the database. If it's somewhere in between, well, pick one, and if that doesn't get enough time out to be acceptable, tune the other side, and keep going back and forth 'till it meets your goal time.

Re: DBI speed up needed on MySQL
by thor (Priest) on May 08, 2005 at 03:51 UTC
    Do you really need all 200 columns when you select? If not, you could select only those that you do need and save yourself some bandwidth. Also, I don't know if it's faster than fetchrow_hashref, but I like to use fetch like so when I need access to the columns by name
    my @fields = qw(foo bar baz); my %results; $sth->execute(); $sth->bind_columns(@results{@fields}); while( $sth->fetch() ) { #your results will be available in %results }
    Maybe I should benchmark it...

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

      Do you really need all 200 columns when you select?

      Yes, I do. :(

        Well, then you'll probably want to get the list of column names via a separate query, store those in an array (e.g. "@colnames", then use the array to define the data query and then load the hash:
        my $sqlstr = "select " . join(",", @colnames) . " from sometable ..."; my $sth = $dbh->prepare( $sqlstr ); while ( $rowref = $sth->fetchrow_arrayref ) { @rowhash{@colnames} = @$rowref; ... }
        (untested, but it should give you a good-enough idea...; um, especially after the update to use "@$rowref" in the hash slice assignment instead of "$$rowref" which was wrong.)

        Since you're using mysql, you could get the list of column names by a query like "describe sometable" -- the column name will be the first column in each returned row. (There are other ways, but this is pretty simple.)

        Can you give a hint of how you can possibly come up with 200 columns in a design? I can't think of anything in life that has 200 aspects that can't be somehow factored out into some other subaspect hierarchy.

        For example, I recall the word "campaign" earlier in this thread (or maybe hallucinated it). If that's the case, and some of these column names are perhaps the names of each state, then you can turn the table sideways and create 50 separate rows per event in a one-to-many relationship instead of having 50 sets of columns (maybe 4 per thing?) to record the same thing.

        I dunno, but my gut tells me that ending up with 200 columns in an SQL table is a bit like ending up with 200 Perl scalar variables instead of an array. A symptom of that is a lot of variables with similar names differing only by some data-like aspect, like a serial number (1, 2, 3, 4) or a state (WA, OR, CA, ID, ...).

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.


        update: As I was chatting with a friend about this problem a bit later, I realized that the problem like this usually results from improperly taking "3 dimensional data" (values depend on 3 inputs) and mapping it to a 2 dimensional table naively: primary key column is one dimension, other columns represent a cross of the other two dimensions. The proper solution is to have two key columns represent two of the three dimensions, and the column names be the third (and hopefully the most diverse and consistent).
Re: DBI speed up needed on MySQL
by JamesNC (Chaplain) on May 08, 2005 at 12:54 UTC
    Yes, I would definitely use a different index. You need an index which has both columns. Try one of these. The second should be faster, but you are only allowed to have one clustered index per table. Use the clustered option on the index you use the most. Do one of these from mysql command line interface.
    CREATE INDEX exp_update ON campaigns ( mw_export, last_update ) --or this CREATE clustered INDEX cl_exp_update ON campaigns ( mw_export, last_up +date )

    JamesNC
      Thanks. I know that I should be using indexes and I was surprised that the ones I created didn't help out much.
Re: DBI speed up needed on MySQL
by CountZero (Bishop) on May 08, 2005 at 08:53 UTC
    print "It takes me over 30 minutes to print this!!!\n";

    Are you printing 40_000 records of 200 fields each to your screen? Or are you doing something else to them? Perhaps that's where the long processing time is?

    Do you save all the (processed) data in some variables/arrays/hashes? That could eat up your memory very fast and then your system starts swapping to disk and performance degrades.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: DBI speed up needed on MySQL
by bart (Canon) on May 08, 2005 at 09:50 UTC
    This being Mysql, I think you might have a memory issue. IIRC, Mysql is notorious for fetching all rows for a query, before it even starts to return the first one. If that's the case, 400000 rows of 200 columns is a lot of memory. So maybe your computer is just trashing about.

    But perhaps they fixed it in a newer Mysql version. Or maybe there's a setting to tweak Mysql, so that it doesn't do that.

Re: DBI speed up needed on MySQL
by thor (Priest) on May 08, 2005 at 11:10 UTC
    As I was falling asleep last night, I thought of something else. My original reply would explain slowness for fetching each row, but not necessarily the first. The issue that you're seeing implies that the database engine is having a hard time locating the rows. However, with 40,000 rows, it should be able to tablescan the table in a few seconds, so there's something going on here. There's another reply in this node that suggests that MySQL fetched the entire resultset before returning the first row. If you can confirm that, then that might your issue you are returning a lot of data...

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

      MySQL fetched the entire resultset before returning the first row

      That's what I suspect is happening, too.

Re: DBI speed up needed on MySQL
by Perimus (Novice) on May 09, 2005 at 09:14 UTC
    A hard issue to address with so little information, but I can offer a few tips that may help

    * if you can, create a 2nd table in your database that contains just your key record (i am assuming you have one) and the two fields you're basing your query on. You can even create this on the fly as a temporary table which is never written to disk, and then query it. Use your array of key values to fetch all 200 fields one record at a time based on your key value, which should be indexed.

    * Don't bother creating an index on mw_export, that will probably slow down your table's performance, especially your write performance (because on write/update the table updates all affected indexes). Indexes don't help on columns that just contain one or two possible values... like yes/no.

    * You didn't specify, but perhaps you're loading a big data file from another source, like a CSV into a table before processing the records, and that's why the table design is so horrid? If this is the case, don't create the indexes on the table until after you've loaded the data into it. This way, it just generates it's index once, not every time you insert a record into the table.

    * When you *have* to build a table like this, it gets big. When a table is this size, with this many fields and records, very small optimizations make a large differnce over time. Review all of your column types and make sure you are being as efficient as possible. Easy things to check are field sizes compared to data sizes, setting NOT NULL on columns that will not contain null values saves you storage and processing each time that value is accessed. Using CHAR instead of VARCHAR saves you a byte of storage and requires less processing by the server to pull that value. CHAR vs VARCHAR depends a lot on your data, and the engine you're using on that table. You can pick up a good comprehensive MySQL reference like "MySQL by Paul DuBois" from O'Reilly ISBN0-7357-0921-1 (available online if you're a member of safari.oreilly.com). Find as many ways to optimize your columns and data as possible.

    * As you process your records one at a time, be sure to retain as little data as possible from the rest of the records in memory. If you're taking each record, storing it's hash reference for later access and moving on, you're forcing your script to keep the entire record set in memory. You want to load one record at a time and when you're done with it, make sure you're letting the garbage collector ditch the data instead of holding onto it by keeping a reference to it's hash defined somewhere.

    good luck!
Re: DBI speed up needed on MySQL
by Jaap (Curate) on May 09, 2005 at 09:46 UTC
    What does last_transfer do?
    $sth->execute(last_transfer{'campaigns'});
    Or is that just missing a $?
Re: DBI speed up needed on MySQL
by erix (Prior) on May 09, 2005 at 19:41 UTC

    In case the problem is memory-consumption + thrashing, it may be worthwhile to batch-process your rows with a LIMIT-clause to your SQL-statement (and repeated execution, of course). Just get n rows at a time and process those, before going on with the next batch. I guess n should be 1000 or 2500 or so (try and err ;)

    Or use fetchrow_arrayref() and get your row processing out of the way immediately.

    (Btw, have you found out anything yet about where the bottleneck is?)

      That's what I ended up doing. Just sending stuff a little at a time with the limit clause, but this is not ideal. Hopefully, under normal circumstances, my script would not handle tens of thousands of rows at once. For important testing purposes, we decided to send over all the rows (and that's when we see the huge slowdown). So far, the bottleneck appears to be what thor suggested: mysql fetches all of the data before returning the first row. The more I increase the limit, the longer it takes to process the first row. ...
Re: DBI speed up needed on MySQL
by stonecolddevin (Parson) on May 09, 2005 at 15:44 UTC
    May I recommend Class::DBI? It's object oriented, fast, relatively easy to use, and great for production purposes.
    meh.
      Perhaps I'm missing something, but how will adding another layer of abstraction to the equation help his speed problems? He's as close to the database as he can be without dropping into XS and he's having the problems. Surely putting another module in between won't help...

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

      CDBI is great. CDBI also slows apps down a great deal, usually, imho.
Re: DBI speed up needed on MySQL
by zgrim (Friar) on May 09, 2005 at 20:28 UTC
    On the iron and adminspotting side I would recommend some of this and such reading. Hints: key_buffer size in server config file, optimize table maintainance, memory resources, etc.
    On the coding side, I don't really think using arrayref over hashref would be much of an optimization, and this kind of tricks. Still, using select * from table is kindof a php thing (sorry, i couldn't resist). :)
    Proper indexes and LIMIT statements will prolly speed things if it is really too late for database proper redesign.

    ... imho :)
    --
    perl -MLWP::Simple -e'print$_%%\n|,<br /> get(q=http://cpan.org/misc/japh=)))'
A reply falls below the community's threshold of quality. You may see it by logging in.