http://qs321.pair.com?node_id=401728

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Monadelphous Monks,

In my quest for speed in data retrieval, I'm experimenting with both common and uncommon structures and routines, with surprising results. This has allowed an interesting comparison of different approaches using DBI, or not. I would appreciate critiques/ideas.

Background

I have a datafile of multiple-choice questionnare results, in the form:

respondent_id: response_x, response_y, ...
There are ~20,000 respondents and 3,000 possible responses, each with it's own ID, although your average respondent would have chosen only about 1/5 of possible responses. So the raw data looks like this:
respondent_1: 123, 456, 789, 23, 1574, ... respondent_2: 65, 1893, 2853, 753, ... etc.
The data is compiled off-line. Once compiled, it never changes, so the speed of inserts, updates, deletions, etc are of absolutely no concern.

What the data is used for is to make cross-referenced queries of the form

count all respondents who gave response x and response y
There may be hundreds of such queries to produce a single report. All possible values of x and y are equally likely to be searched on, and there are millions of possible permutations and combinations, so there's no opportunity to pre-generate reports - they have to be generated on the fly. The speed of SELECTs, therefore, is of paramount importance. Hence these experiments.

The Experiments

I've tried various mysql schemas, populated them with the same test data, and then benchmarked standard queries. The test data is the full 20,000 respondents x average 600 responses per respondent, so it accurately reflects the real application.

db version structure
DB 0

the obvious respondent-per-row, column-per-response_id format

CREATE TABLE theTable (respondent, response_1, rsesponse_2, ...response_3000)

where all response columns are tinyint(1) and filled with 1 or 0, produces a 20,000 x 3,000 table.

   
DB 1

A single two-column table with compound index

CREATE TABLE `theTable` ( `response` smallint(5) unsigned NOT NULL default '0', `respondent` smallint(5) unsigned NOT NULL default '0', PRIMARY KEY (`response`,`respondent`) )

requiring a row for each response of each respondent = ~ 12-million rows

DB 2

DB 1 with keys added to each column

CREATE TABLE `theTable` ( `respondent` smallint(5) unsigned NOT NULL default '0', `response` smallint(5) unsigned NOT NULL default +'0', PRIMARY KEY (`respondent`,`response`), KEY `respondent` (`respondent`) +, KEY `response` (`response`) )
DB 3

Getting creative - trying to minimize the amount of data sifting per query - tried one table per response id as an inverted index. That is, each table is a list of respondents who gave that response:

for $i (1 .. $num_responses) { $sql = "CREATE TABLE r".$i." (respondent smalli +nt unsigned, PRIMARY KEY (respondent)) $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); $sth->execute() or die("Could not execute!" . $sth->errstr); }
DB 4

Took DB 3 a step further: Wondering about all this database and DBI overhead, but still interested in minimzing the data has has to be loaded and manipulated for a query, fell back on really old basics. Forgot about mysql, and just created one text file for each table in DB 3. Idea is to just load two small text files into arrays and find their intersection. So the "create" routine looks like;

for $i (1 .. $num_responses) { $file = ">response_".$i; open(FILE, $file) or dienice("cannot open file : $_[0] $!"); }

Benchmarking

I put all the mysql databases through two tests:

  1. A standard SQL query benchmark using JOINS where necessary, like this:
    $sql = "appropriate select statement"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_2 = new Benchmark; for $i (0 .. $iterations) { $sth->execute() or die("Could not execute 1!" . $dbh->errstr); } $count = $sth->fetchrow_array(); $end_2 = new Benchmark; $diff_2 = timediff($end_2, $start_2);
    Where the Selects were as follows:
    db version query
    DB 0 SELECT COUNT(*) FROM theTableWHERE (r_x = 1 and r_y = 1)
    DB 1 SELECT COUNT(*) FROM theTable A JOIN theTable B ON (A.respondent = B.respondent) WHERE A.response = ? AND B.response = ?
    DB 2 Same as DB 1
    DB 3 SELECT COUNT(*) FROM t_x JOIN t_y ON (t_x.respondent = t6_y.respondent)
  2. Separate queries for each response_id, then finding the intersection of the results in Perl, like this:
    db version Perl Intersection Method
    DB 0
    $sql = "appropriate select for response_id_1" $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $sql = "appropriate select for response_id_2" $stm = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $start_1 = new Benchmark; for $i (0 .. $iterations) { $sth->execute() or die("Could not execute 1!" . $dbh->errstr); while ($data = $sth->fetchrow_array()) { $union{$data} = undef; } $stm->execute() or die("Could not execute 2!" . $dbh->errstr); while ($data = $stm->fetchrow_array()) { $isect{$data} = undef if exists $union{$data}; } @isect = keys %isect; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1);
    DB 1
    DB 2
    DB 3

And I put the simple text-file "database", DB 4, through a similar array-intersection benchmark routine:

$start_1 = new Benchmark; for $i (0 .. $iterations) { $table = "<".$db_dir."/r".$response_id; open(FILE, $table) or dienice("cannot open file : $table + $!"); while ($data = <FILE>) { $union{$data} = undef; } close(FILE); $table = "<".$db_dir."/r6000"; open(FILE, $table) or dienice("cannot open file : $table + $!"); while ($data = <FILE>) { $isect{$data} = undef if exists $union{$data}; } close(FILE); @isect = keys %isect; $response_id++; } $end_1 = new Benchmark; $diff = timediff($end_1, $start_1);

Note especially the "$response_id++;" to ensure that I'm retrieving a new file each iteration to be sure I'm not just working in cache.

Here are the benchmark results. These are average times for execution of the select and join/intersection necessary to produce the sought after count, in seconds, after repeating the benchmarks ~20 times over two days, with iterations varying from 50 to 100:

db version SQL Method Perl Intersection Method
DB 0 0.25 1.0
DB 1 0.13 0.14
DB 2 0.25 0.25
DB 3 0.08 0.12
DB 4 N/A 0.03

So here's my conundrum. If I believe I've done everything right, then for this specific application, a plain old text file system is outperforming every imaginable version of mysql w/ DBI, by a factor of 10 compared to the "standard" structures.

I don't want to abandon mysql & DBI, but for this specific application, where select speed is everything, it's tempting. I guess what I'm looking for from my fellow monks - those intrepid enough to have read down this far - is an idea of - am I totally out to lunch? Have I done something really stupid to arrive at these results? Or do these rsults make sense, and it's just accepted that we all knowingly give up speed every day in favour of the convenience of SQL and DBI?

Janitored by Arunbear - added readmore tags, as per Monastery guidelines

Replies are listed 'Best First'.
Re: Basic Perl trumps DBI? Or my poor DB design?
by thospel (Hermit) on Oct 22, 2004 at 23:58 UTC
    Why are you surprised ? Databases never promised to be fast...

    The point of a database is managability of the data and maintainability. Things like ACID properties, remote access, simultanous access, backups, etc. It's that all the sophisticated stuff like memory caching, writeback, indices, logfiles etc. has all already been done for you.

    But in the end, there is *no magic*. A database also gets its data from files (or cached memory) and then does a data exchange with you over some form of IPC. For any specific case you can almost certainly handroll a version that's faster because you can skip many steps needed for the powerfull generic version. Even with a relatively slow language like perl it can often be done faster. Even for rather sophisticated setups you can often beat an SQL database by simply using DBM cleverly.

    But then, speed was never the point. Avoiding one off solutions for everything is.

    (All of which doesn't imply that neither the DBI version nor the pure perl version can't be improved anymore)

Re: Basic Perl trumps DBI? Or my poor DB design?
by BrowserUk (Patriarch) on Oct 23, 2004 at 00:25 UTC

    For a single-tasking, (ie. no concurrency issues) application, a well-written, non-DB (RDBMS) solution will always be able to out-perform an RDBMS solution for any given task.

    Simple logic. Whatever steps are required to produce a given set of results, all those steps still need to happen regardless of whether you code them yourself, or whether they are performed by the RDBMS on your behalf.

    When you code them yourself, you can optimise those steps specifically to the your requirements.

    When the equivalent steps are performed by the DB, they will be optimised for the general case. I addition, the RDBMS will also have the overhead of:

    1. Translating SQL (or other data manipulation language) into it's internal form
    2. Optimising the query.
    3. Communications
    4. Locking
    5. Security

    However, performance is not the primary reason for storing your data in an RDBMS--all of those overheads are.

    Once you store your data in an RDBMS, when a second application comes along that needs to access the same data in a different way, or combine it with other data:

    1. You no longer have to write the code to store, access, search, format or update that data.
    2. You no longer have to optimise that code.
    3. You no longer have to implement, test, debug your own bespoke communications protocols to retrieve your data from remote locations.
    4. You no longer have to implement, test, debug your own locking to achieve concurrency.
    5. You No longer have to implement, test, debug or administer your own security protocols.

    These are all very good reasons for using an RDBMS. It's not all sweetness and light. SQL is a pain, normalising data to allow for efficient access for conflicting uses is a (black) art form. Database optimisation is hard, simply because you do not have full control, and there are many conflicting issues that need to be considered.

    The very worst reason for considering moving data into an RDBMS is ultimate performance. That's not to say that achieving a reasonable performance isn't possible with an RDBMS. Nor even that, given the amount of effort expended by generally very clever people, over long periods of time, to develop, refine and tune RDBMS performance, that it won't do a better job, more quickly, than most of us will be able to achieve at our first (or even second or third) attempts at a given problem.

    It just means that you cannot have all of the additional benefits that you get from using them, without paying some price; and that price is a reduction in ultimate performance.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      For a single-tasking, (ie. no concurrency issues) application, a well-written, non-DB (RDBMS) solution will always be able to out-perform an RDBMS solution for any given task.

      Not necessarily. In over simplified theory you would be correct but, what if you have terabytes or exabytes of data to search?

      You make the inaccurate assumption that DBMS == RDBMS, which is not true. It may be that 90% or more of all databases are built around the relational model, there are plenty that are not.. consider DataWarehousing, an often misunderstood concept. RDBMS solutions rarely fit the needs of Datawarehousing. What about ODBMS (object database management systems) such as Poet?

      Jason L. Froebe

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        No over simplified theory. The volume of data is irrelevant. A "well-written" application wouldn't search, it would index.

        That is no inaccurate assumption, it is a deliberate limiting of scope of my assertion.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Basic Perl trumps DBI? Or my poor DB design?
by tilly (Archbishop) on Oct 23, 2004 at 00:30 UTC
    You should add an index on the response column of DB 2. Alternately try to reverse the primary key, make it response/respondant.

    Think about how the database has to execute the query. It needs to select all rows with one response and match them up with all rows with another response by respondant. How can the database search for all rows with one response? Well it can either walk the whole table, or it can walk the index looking at every respondant to see if it has a response. Both involve essentially searching the whole table for each response.

    I don't think that it will get you to where you want to be. But it should get you from DB 2's current performance to something close to DB 3's performance with a lot less mess.

    You should also be able to find some MySQL tuning parameters (look for things like how much shared memory it uses) that improve its performance more.

    But still thospel remains correct. Relational databases involve a lot of overhead, and can be beaten on performance. What they win on is simplicity, maintainability, and the fact that it is easier to extend and modify a complex query than a complex piece of carefully optimized code. If those wins are not important to you, and (after careful benchmarking) you find that you can beat the database, the database is not always the best choice.

Re: Basic Perl trumps DBI? Or my poor DB design?
by perrin (Chancellor) on Oct 23, 2004 at 01:58 UTC
    DB2 should be many times faster than DB1. If it isn't, you're doing something wrong. For one thing, I see you put in placeholders for the SQL but don't seem to be passing it any values. Also, the SQL is very different between these. For DB1 and DB2 you are supplying values for response but on the others you just join on respondent.

    I strongly recommend that you learn how to use the EXPLAIN command in MySQL to determine if your indexes are correct. A correctly indexed table is the best way to approach this problem.

      "For one thing, I see you put in placeholders for the SQL but don't seem to be passing it any values."

      This is quite strange actually. I never used MySQL, but I would expect it to throw SQL error just like Oracle does, when not all variables are bind. I think we are not getting all the facts. The actual code used for testing might be a little bit different.

        Yes, sorry 'bout that - the basic benchmark routine was adjusted for the needs of each sql statement, but I thought my post was long-winded enough. Can always count on Monks' eye for detail....
Re: Basic Perl trumps DBI? Or my poor DB design?
by lhoward (Vicar) on Oct 23, 2004 at 04:49 UTC
    If you have enough memory, you could write a daemon process that would hold the entire dataset in memory. Other processes could communicate with the daemon process to ask questions. Have the daemon process periodically re-read the data from disk to bring in the changes. That way you could get even higher performance than the custom flat-file approaches by avoiding all disk IO.

    L

      If you have enough memory, you could write a daemon process that would hold the entire dataset in memory.

      Ok, I'll bite. How can you do this? Is there a basic template/framework that just opens a socket and talks back and forth between whoever's on the other side? I used to do this in C decades ago (sheesh), but it's alien to me in perl. (Well, I'd rather use a package that facilitates this, if one exists, rather than write it using raw code from scratch.)

      FWIW, my use of this would be extremely similar to the example posed by the originator of this thread. In my case, I have keywords associated with all my photos on my site, and I just want to optimize the way users search for images based on those keywords. I have the code all written and works fine, technically, but I hate the fact that the entire textfile-DB has to load/parse every time, just to spew out answers and the close up shop... and to do this for EVERY search. It seems that the best thing to do is have a daemon do the open, and then perform searches (based on perl hashes) for each query. Any potential drawback with this?

        In your case I'd use something like a SQL database with DBIx::TextIndex or Plucene. The main reason I recommended a daemon for him is because it could be highly optimized for the types of relationships he's modeling.

        But if you're going down the persistent network daemon route, I'd start with something like Net::Daemon. Its a very nice framework to start building on.

        L

Re: Basic Perl trumps DBI? Or my poor DB design?
by castaway (Parson) on Oct 23, 2004 at 07:20 UTC
    I'd also advocate, that before you come to a decision that Databases are ultimately slow (and they are, for reasons other people have mentioned in this thread), that you try a database other than MySQL. Oracle and DB2 both have downloadable versions, for example. One reason at least: MySQL keeps its tables in files, one (or more?) per table, DB2, OTOH, has whole groups of tables (tablespaces) in one file.. There are probably other reasons too. Maybe even trying with SQLite will get you different results.

    C.

Re: Basic Perl trumps DBI? (Forget DBs!(for this application))
by BrowserUk (Patriarch) on Oct 24, 2004 at 13:22 UTC

    With your answers being simply yes or no, and the size of your dataset, the whole process can be performed entirely within memory and in a constant time per question.

    For your dataset 20,000 x 3,000, that requires 60 MB of ram and will determine the answer to any question in an average of about 250 milliseconds.

    If both the number of respondents and the number of questions doubled, 40,000 x 6,000, this would rise to just 240 MB and 1/2 a second.

    In terms of scalability, a 2GB memory machine should be able to handle 200,000 respondants x 10,000 questions, or well over half a million respondants at 3,000 questions apiece and process each question (on my 2GHz machine) at the rate of 250/millisecond or quarter of a million responses per second! (See 402390 for my excuses for this mistake.)

    Please note this is for any question (within the limited framework I have code for), no matter how complex. Eg.

    • 01. How many respondents answered question 0
    • 02. How many respondents did not answer question 0
    • 03. How many respondents answered question 10
    • 04. How many respondents answered question 100
    • 06. How many respondents answered 1000, 2000 but not 2500?
    • 07. How many respondents answered every 10th question?
    • 08. How many respondants answered question 3000?
    • 09. How many respondants answered 123, 456, or 789 but not 135, 246, 357 or 468?
    • 10. How many respondants answered all of 1000, 1010, 1020 but none of 2000, 2010, 2020?

    To do this, I encode each respondants answers positionally in a string. If they answered yes to question 65, set the 64th byte of a 3,000 character string of 0's to 1. The same for all the others. The response strings are stored in a simple array. 20,000 strings of 3,000 bytes requires almost exactly 60 MB.

    The question is then encoded in a similar way.

    I've coded for questions of the form

    any( these ) and not any( those ) all( these ) and not any( those ) any( these ) and not all( those ) all( these ) and not all( those )

    Where 'these' and 'those' can be a list of any number of quesion numbers.

    An empty all( ) for the 'these' part of the question effectively passes all and the count becomes entirely dependant upon the second (negative) part of the question, as with question 2.

    An empty any( ) for the second part makes the question entirely dependant on the first part as with Qs 1,3,4, 8.

    This could easily be extended, but is surprisingly flexible as is, allowing all the above questions to be answered.

    Uncommenting the line inside the loop allow you to obtain the id's of the respondants matching the query, in addition to the counts. Building the array slows it down slightly.

    The code:

    #! perl -slw use strict; use Benchmark::Timer; my $T = new Benchmark::Timer; $|=1; our $RESPONDENTS ||= 20_000; our $QUESTIONS ||= 3_000; sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } sub makeMask { my $compiled = '0' x $QUESTIONS; substr $compiled, $_, 1, '1' for @_; return $compiled; } sub any { ## ( set, mask ) = @_; return ( $_[ 0 ] & $_[ 1 ] ) != 0 ? 1 : 0; } sub all { ## ( set, mask ) = @_; return ( $_[ 0 ] & $_[ 1 ] ) eq $_[ 1 ] ? 1 : 0; } my %func; @func{ 'any', 'all' } = ( \&any, \&all ); my @data; push @data, rndStr( $QUESTIONS, (0)x1, 1 ) for 1 .. $RESPONDENTS; my $re_parse = qr[^ \s* ( (?ix-sm: any | all ) ) \( \s* ( [^)]+ ) \s* +\) \s* $]x; my( %counts, %repondants ); while( my $question = <DATA> ) { my( $iMethod, $iSet ) = <DATA> =~ $re_parse; my $inclusive = $1 ? makeMask( split ' ', $iSet ) : undef; my( $xMethod, $xSet ) = <DATA> =~ $re_parse; my $exclusive = $1 ? makeMask( split ' ', $xSet ) : undef; $T->start( $question ); $func{ $iMethod }->( $_, $inclusive ) and not $func{ $xMethod }->( $_, $exclusive ) and $counts{ $question }++ # and push @{ $respondants{ $question } }, $_ for @data; $T->stop( $question ); } $T->report; print "$_: $counts{ $_ }" for sort keys %counts; __DATA__ 01. How many respodents answered question 0 all( 0 ) any( ) 02. How many respondents did not answer question 0 all( ) all(0) 03. How many respodents answered question 10 all( 10 ) any( ) 04. How many respodents answered question 100 all( 100 ) any( ) 04. How many respondant answered questions 14, 57, 103? all( 14 57 103 ) all( ) 06. How many respondent answered 1000, 2000 but not 2500? all( 1000 2000 ) all( 2500 ) 07. How many respondents answered every 10th question? all( 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 19 +0 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 + 370 380 390 400 410 420 430 440 450 460 470 480 490 500 510 520 530 +540 550 560 570 580 590 600 610 620 630 640 650 660 670 680 690 700 7 +10 720 730 740 750 760 770 780 790 800 810 820 830 840 850 860 870 88 +0 890 900 910 920 930 940 950 960 970 980 990 1000 1010 1020 1030 104 +0 1050 1060 1070 1080 1090 1100 1110 1120 1130 1140 1150 1160 1170 11 +80 1190 1200 1210 1220 1230 1240 1250 1260 1270 1280 1290 1300 1310 1 +320 1330 1340 1350 1360 1370 1380 1390 1400 1410 1420 1430 1440 1450 +1460 1470 1480 1490 1500 1510 1520 1530 1540 1550 1560 1570 1580 1590 + 1600 1610 1620 1630 1640 1650 1660 1670 1680 1690 1700 1710 1720 173 +0 1740 1750 1760 1770 1780 1790 1800 1810 1820 1830 1840 1850 1860 18 +70 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2 +010 2020 2030 2040 2050 2060 2070 2080 2090 2100 2110 2120 2130 2140 +2150 2160 2170 2180 2190 2200 2210 2220 2230 2240 2250 2260 2270 2280 + 2290 2300 2310 2320 2330 2340 2350 2360 2370 2380 2390 2400 2410 242 +0 2430 2440 2450 2460 2470 2480 2490 2500 2510 2520 2530 2540 2550 25 +60 2570 2580 2590 2600 2610 2620 2630 2640 2650 2660 2670 2680 2690 2 +700 2710 2720 2730 2740 2750 2760 2770 2780 2790 2800 2810 2820 2830 +2840 2850 2860 2870 2880 2890 2900 2910 2920 2930 2940 2950 2960 2970 + 2980 2990 3000 ) any( ) 08. How many respondants answered question 3000? all( 2999 ) any( ) 09. How many respondants answered 123, 456, or 789 but not 135, 246, 3 +57 or 468? any( 123 456 789 ) any( 135 246 468 ) 10. How many respondants answered all of 1000, 1010, 1020 but none of +2000, 2010, 2020? all( 1000 1010 1020 ) any( 2000 2010 2020 )

    The results applied to a simulated dataset match the size desccribed:

    P:\test>401728 1 trial of 01. How many respodents answered question 0 ( 234.375ms total), 234.375ms/trial 1 trial of 02. How many respondents did not answer question 0 ( 187.500ms total), 187.500ms/trial 1 trial of 03. How many respodents answered question 10 ( 250ms total), 250ms/trial 1 trial of 04. How many respodents answered question 100 ( 234.375ms total), 234.375ms/trial 1 trial of 04. How many respondant answered questions 14, 57, 103? ( 93.750ms total), 93.750ms/trial 1 trial of 06. How many respondent answered 1000, 2000 but not 2500 +? ( 125ms total), 125ms/trial 1 trial of 07. How many respondents answered every 10th question? ( 62.500ms total), 62.500ms/trial 1 trial of 08. How many respondants answered question 3000? ( 250ms total), 250ms/trial 1 trial of 09. How many respondants answered 123, 456, or 789 but n +ot 135, 246, 357 or 468? ( 703.125ms total), 703.125ms/trial 1 trial of 10. How many respondants answered all of 1000, 1010, 102 +0 but none of 2000, 2010, 2020? ( 156.250ms total), 156.250ms/trial 01. How many respodents answered question 0 : 10067 02. How many respondents did not answer question 0 : 9933 03. How many respodents answered question 10 : 9797 04. How many respodents answered question 100 : 10068 06. How many respondent answered 1000, 2000 but not 2500? : 2561 08. How many respondants answered question 3000? : 9969 09. How many respondants answered 123, 456, or 789 but not 135, 246, 3 +57 or 468? : 1121 10. How many respondants answered all of 1000, 1010, 1020 but none of +2000, 2010, 2020? : 293

    Note: Question 7 produces no results as no records matched.

    The entire dataset can be stored pre-encoded in a single 60 MB file. in the event that your dataset grew beyond the capacity of memory (approx. 3/4 of a million respondants with 3000 answers on a machine with 2GB of ram), then simple switching to using Tie::File and processing the data in a single pass from disk would still easily out perform an (R|SQL)DB(MS).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Basic Perl trumps DBI? Or my poor DB design?
by pg (Canon) on Oct 23, 2004 at 08:11 UTC

    As perrin said, learn how to use explain. Your database decides which index (access path) to use. The one access path seems to be the best to you, might not seem to be optimized to the database. Explain tells you exactly which access path the database actually uses.

    Also I am not sure MySQL let you do analyze like Oracle does. In Oracle, once a while, you have to run an analyze, so that Oracle gathers the lastest statistics data about your database. Without the right statistics, database could easily choose the wrong access path. Sometime even decides not to use index at all.

Re: Basic Perl trumps DBI? Or my poor DB design?
by demerphq (Chancellor) on Oct 24, 2004 at 09:43 UTC

    This is an interesting question. I think its related to a bunch of other problems like the recipe problem, and the scrabble/crossword problem. You're probably right that this isnt suited to InDB work, except for one aspect of what you said.

    The data is compiled off-line. Once compiled, it never changes, so the speed of inserts, updates, deletions, etc are of absolutely no concern.

    Going from this to assuming that H/W assets like disk size arent a big problem for you, then I would find out the maximum number of indexes for the tables you are using and then create enough tables and enough indexes that every single column is indexed. Then you would need to have a "precompile" stage before actually running your reports which would convert the actual requirements into the correct SQL select statement to find the data. (Ie, if they want responseX you need to know that its field N of table T). Since each of the columns would be indexed and you would only be joining the columns you need it would be like:

    select count(resp_id) from table1,table5,table32,table63 where table1.resp_id=table5.resp_id and table5.resp_id=table32.resp_id and table32.resp_id=table63.resp_id and table1.field3=1 and table1.field8=1 and table5.field16=1 and table32.field7=1 and table63.resp_id=1

    Since it would stay totally in index I should think the performance would be quite good for the tradeoff of using vast disk space. :-)

    Anyway, no matter how you actually build your reports I would probably be looking at some kind of in DB representation as well. Having both can be really useful, especially if you can go from one to the other. :-)

    BTW a last thought is try not to think MySQL = RDMBS or DBMS. Its a common mistake in these parts that isnt always helpful.

    Oh, um, also why do you reply to yourself and not to the replies? It seems like some of your replies are incorrectly parented, and I cant think why. You should make sure you are replying to the correct thing so people can follow the conversation.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


Re: Basic Perl trumps DBI? Or my poor DB design?
by TedPride (Priest) on Oct 23, 2004 at 06:26 UTC
    SQL and DBI are meant to provide an easy to use interface that saves you coding time. Most database applications require far more time to write by hand than is taken up by all the queries conducted over a year plus. The big question is, do you need that extra speed here? Is maintaining and updating 3000 files worth the extra coding time? In this case, probably yes, since you're already doing the primary database work and just want efficient searching of data. You could, however, put all the data in one file, with a secondary file showing start and end locations for each question set. This would leave you most of your speed without requiring thousands of individual files.
Re: Basic Perl trumps DBI? Or my poor DB design?
by Anonymous Monk on Oct 23, 2004 at 11:08 UTC

    You came to PerlMonks three weeks ago, and started asking questions like:

    Then, one week later, all of a sudden you have become a database and benchmarking expert and started blowing your trumpet (Basic Perl array intersection faster than mysql query join.) until today you releasedthe full wisdom of you findings.

    I don't buy it. To me it appears that the thing you have understood better so far (Perl) is better than the thing you haven't fully understood (the database).

    If you want to propose a public comparison, you have to publish the test data you were using. If the data is classified and can't be released, then generate some dummy data and release either the generation script or the data itself (perhaps by means of an external link). If you do that, then perople here could do better than guessing what could be behind your reasoning.

    As for me, I don't believe your benchmarking for a simple reason: databases are definitely better than Perl at crunching tabular data, provided that you use them correctly.

    This is actually a challenge. Show us the data, and then we'll see which method is really the best one.

      Hmmm - hardly an expert - more like someone ignorant enough to ask if what he thinks he's understood is in fact correct. The kind patience of those who've helped tells me they've understood it like that, as was intended. And my one way of giving back is to summarize lessons learned for others' future reference.

      As for me, I don't believe your benchmarking for a simple reason: databases are definitely better than Perl at crunching tabular data, provided that you use them correctly.

      Darn - and I had so enjoyed the unanimity of the "yes, that's what you should expect" crowd so far.

      Anyway - no matter. And no problem with the challenge. I've been using vanilla test data anyway. Here it is:

      DB 0

      $num_respondents = 20000; $offset = 0; for $respondent (1 .. $num_respondents) { $sql = "INSERT INTO theTable VALUES ($respondent"; for $response (1 .. $offset) { $sql = $sql.", 0"; } $response = $offset + 1; while ($response < $num_responses+1) { $sql = $sql.", 1"; if (($num_responses - $response) > 4) { $sql = $sql.", 0, 0, 0, 0"; } else { for $response ($response+1 .. $num_responses) { $sql = $sql.", 0"; } } $response = $response + 5; } if ($respondent % 2) {$sql = $sql.", 1, 0";} else {$sql = $sql.", 0, 1";} $sql = $sql.")"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh- +>errstr); print "<p>$sql\n"; $sth->execute() or die("Could not execute!" . $dbh->errstr); if ($offset < 4) {$offset++;} else {$offset = 0;} }
      This produces rows that look like:
      respondent_1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0... respondent_2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0... respondent_3, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0...
      so that each respondent gave every fifth response and each response was given by every fifth respondent, except for the last two which represent gender - each of those was given by 1/2 the respondents and every benchmark query uses one of the gender columns.

      DB 1 & DB 2

      $sql = "INSERT INTO theTable (response, respondent) VALUES (?,?)"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $i = 1; for $respondent (1 .. 20000) { for $r (0 .. 600) { $response = $i + ($r * 5); $sth->execute($response, $respondent) or die("Could not ex +ecute!" . $dbh->errstr); } if ($respondent % 2) { $response = 9999; $sth->execute($response, $respondent) or die("Could not ex +ecute!" . $dbh->errstr); } else { $response = 6666; $sth->execute($response, $respondent) or die("Could not ex +ecute!" . $dbh->errstr); } if ($i < 5) {$i++;} else {$i = 1;} }
      This produces:
      1, 1 6,1 11, 1 ... 2, 2 7, 2 12, 2 ... 3, 3 8, 3 13, 3 ...
      so that, again, each respondent gave every fifth response and each response was given by every fifth respondent, except for the gender codes which are half-half.

      DB 3

      $respondents_by_5 = 4000; $sql = "SHOW TABLES"; $sth = $dbh->prepare($sql) or die("Could not prepare!" . $dbh->err +str); $sth->execute() or die("Could not execute!" . $dbh->errstr); $j = 1; while ($table = $sth->fetchrow_array()) { if ($table ne 't6000' && $table ne 't9000') { $sql = "INSERT INTO ".$table." VALUES (?)"; $std = $dbh->prepare($sql) or die("Could not prepare! At t +able = $table because " . $dbh->errstr); for $i (0 .. $respondents_by_5) { $respondent = $j + ($i * 5); $std->execute($respondent) or die("Could not execute!" + . $dbh->errstr); } if ($j < 5) {$j++;} else {$j = 1;} } } $sql = "INSERT INTO t9999 VALUES (?)"; $stf = $dbh->prepare($sql) or die("Could not prepare! At table = $ +table because " . $dbh->errstr); $sql = "INSERT INTO t6666 VALUES (?)"; $stm = $dbh->prepare($sql) or die("Could not prepare! At table = $ +table because " . $dbh->errstr); for $respondent (1 .. 20000) { if ($respondent % 2) { $stm->execute($respondent) or die("Could not execute!" . $ +dbh->errstr); } else { $stf->execute($respondent) or die("Could not execute!" . $ +dbh->errstr); } }
      This produces tables that look like:
      Table_1:1, 6, 11, 16, ..., 19996 Table_2:2, 7, 12, 17, ..., 19997 Table_3:3, 8, 13, 18, ..., 19998 Table_4:4, 9, 14, 19, ..., 19999 Table_5:5, 10, 15, 20, .., 20000 Table_6:1, 6, 11, 16, ..., 19996 etc.
      one row per table entry.

      DB 4

      $j = 1; for $r (1 .. $num_responses) { @lines = (); $table = ">".$db_dir."/r".$r; for $i (0 .. $respondents_by_5) { $respondent = $j + ($i * 5); push (@lines, $respondent); } &write_lines_to_file(1, $table, @lines); if ($j < 5) {$j++;} else {$j = 1;} } for $respondent (1 .. 20000) { if ($respondent % 2) { push(@lines_6666, $respondent); } else { push(@lines_9999, $respondent); } } $table = ">".$db_dir."/r6666"; &write_lines_to_file(1, $table, @lines_6666); $table = ">".$db_dir."/r9999"; &write_lines_to_file(1, $table, @lines_9999);
      where "write_lines_to_file" is
      sub write_lines_to_file { my $add_line_return = shift; my $file = shift; my @lines=@_; open(FILE, ">$file") or dienice("cannot open file : $!"); $STDOUT_Handle = select(FILE); for my $i (0 .. $#lines) { print "$lines[$i]"; if ($add_line_return) {print "\n";} } close(FILE); select($STDOUT_Handle); }
      This produces the same tables as DB3, but each in its own text file.

      If you can see something that alters my understanding, I'm all ears.

Re: Basic Perl trumps DBI? Or my poor DB design?
by punch_card_don (Curate) on Oct 23, 2004 at 02:09 UTC
    Thanks everyone. OK, lesson learned the hard way about database versus custom-rolled-basic-Perl. Honestly, for this application, it'll be a judgement call, because speed really is very very important.

    Perrin - thought this would be a god summary of these investigations that could serve as future reference for someone else some day.

    And I'm very interested in getting DB2 to work 'much faster' than DB1 - but something is clearly eluding me. Will do more research over the weekend.......

Re: Basic Perl trumps DBI? Or my poor DB design?
by punch_card_don (Curate) on Oct 23, 2004 at 00:18 UTC
    P.S. - Many thanks to all those here at PM who provided guidance in developing some of these structures and tests.
Re: Basic Perl trumps DBI? Or my poor DB design?
by punch_card_don (Curate) on Oct 23, 2004 at 00:20 UTC
    And now I'm really wondering what I could accomplish with PDL!
      Here is a small example using PDL. I read 7,500,000 bytes into a flat array and use slice indexing to find the right bytes to mask against.
      use strict; use IO::File; use PDL; use PDL::IO::FlexRaw; use PDL::NiceSlice; my $fn = "survey.dat"; my $fh = new IO::File($fn,"r"); my $width = 3000/8; my $height = 20000; my $total = $width*$height; # Read it all in as one flat file for simplicity. my $pdl = readflex $fh, [{Dims=>[$total], Type=>'byte'}]; # For example, find which items have bit 4 set in the # 7th byte and bit 3 set in the 20th byte. my $m1 = 1 << 4; my $m2 = 1 << 3; my $b1 = 7; my $b2 = 20; # Which record numbers contain the given one-bits? # Take slices with a width of 3000/8 = 375. $total = $total - 1; my $s1 = $pdl($b1:$total:$width); my $s2 = $pdl($b2:$total:$width); my $index = which((($s1 & $m1) > 0) & (($s2 & $m2) > 0)); print $index;
Re: Basic Perl trumps DBI? Or my poor DB design?
by dragonchild (Archbishop) on Oct 25, 2004 at 00:49 UTC
    This isn't a Perl question so far as it's a relational database design question. So, I'll treat it as such.

    You have two different tables:

    • respondent, where you have info about the respondent - name, address, etc
    • responses, where you list the responses each respondent took

    So, given the info you have, I would design tables as such:

    CREATE TABLE respondent ( id unsigned int not null auto_increment primary key ,name varchar(20) # , etc ... ); CREATE TABLE response ( id unsigned int not null auto_increment primary key ,name varchar(20) ); CREATE TABLE responses ( respondent unsigned int ,response unsigned int ,PRIMARY KEY (respondent, response) );

    Then, I would create the query as such:

    SELECT A.respondent AS respondent FROM responses A JOIN responses B on (A.respondent = B.respondent) WHERE A.response = ? AND B.response = ?

    I would call it as such:

    my $sth = $dbh->prepare_cached( $sql ); $sth->execute( $response_x, $response_y ); $sth->bind_columns( \my ($respondent) ); my @respondents; while ($sth->fetch) { push @respondents, $responses; } $sth->finish;

    Depending on the power of your database server, this should run in under 3-5 seconds, given the numbers you mentioned. I should warn you that my experience is primarily with MySQL 4.1.x, so I don't have a lot of knowledge of the 3.x series.

    I would also turn on query_caching, which will help a bunch with the reporting. I personally have found a 90% improvement with certain kinds of reporting when I turned it on.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Basic Perl trumps DBI? Or my poor DB design?
by radiantmatrix (Parson) on Oct 24, 2004 at 23:27 UTC

    It's entirely possible that for a given application file access is faster than DBI -- after all, an RDMBS stores its data in files anyway. The real strength of an RDBMS is in the 'R': Relational.

    While linear selects are probably faster if you can hand-optimize your file access with a Perl script, an RDBMS shines with non-linear data. With proper indexes, an RDBMS can return results for fairly complicated relationship queries incredibly quickly.

    Of course, you could code that functionality for your specific queries yourself; indexing and optimizing for a specific set of queries isn't terribly difficult. But, the RDBMS can be maintained much more easily -- it's optimizations won't fall apart when you add a new query or change an existing one.

    If speed truly is everything, then you probably are best off hand-coding a select routine; but if anything else is a consideration, you're probably best to see what optimization you can do with your RDBMS.

    radiantmatrix
    require General::Disclaimer;
    "Users are evil. All users are evil. Do not trust them. Perl specifically offers the -T switch because it knows users are evil." - japhy
Re: Basic Perl trumps DBI? Or my poor DB design?
by zakzebrowski (Curate) on Oct 24, 2004 at 23:57 UTC
Re: Basic Perl trumps DBI? Or my poor DB design?
by punch_card_don (Curate) on Oct 25, 2004 at 22:28 UTC
    Holy cow - thanks all for the input. This is really being extraordinarily interesting.

    OK, i've pretty much digested the common thread to replies - that so long as the data fits in memory, there should be no surprise that a custom coded Perl solution beats an overhead-laden rdbms soltuion. The question is a strategic one - trade flexibility, robustness and standardization for speed, or not?

    demerphq's ideas appear to me to closely resemble my DB3 - splitting the data up into enough mysql tables to have every column indexed, except I went to the extreme and had one column per table. We got the result he expected - DB3 was the fastest of all mysql-based queries. The mere fact he suggested doing something like this makes me think I should keep this solution in the back of my mind - maybe sometimes odd design can justify itself.

    BrowserUK - you are a coding machine! But I'm not clear on how we went from talking about 250milliseconds-per-query in paragraph 2, then 250 queries-per-millisecond in paragraph 4. I do very much like your string-row solution. Again, it's giving me ideas about reading up on piddles. Do I understand correctly that the trial times are for a single execution? If so, it looks like your strategy applied to 3,000 individual text tables is still holding the speed record. Thanks.

    I think then what I have in DB4 is a baseline. This is a fast as you could possibly hope for, so shoot for the closest to to that within a mysql solution. At least we know how 'good' a solution is with a baseline for comparison. I wonder if a hand-coded Perl query system might make a good standard practice in database development when the data lends itself to that, so that developers know how fast is fast in a particular situation?

    The various treatments of DB1 & 2, including dragonchild's, make me really want to see that schema approach the baseline. I think it might be the best compromise.

    SO, I'm embarking on a matrix of tests on that schema. The variables are:

    • order of declaration of columns
    • order of declaratino of compound primary key
    • order of declaration of KEYs
    • order of population of series (which for-loop goes inside which)
    I count a total of 24 permutations, including ones with no KEYs. This wil take a day or two to find time - - - then I'll start a new thread.

    Then we'll make a strategic choice.

    Thanks all for your participation.

      But I'm not clear on how we went from talking about 250milliseconds-per-query in paragraph 2, then 250 queries-per- millisecond in paragraph 4

      Your right. The post was written in two stages. Originally it was based on a few lines of code that I threw togther to test the idea out. No subroutines (or their overhead). Only positive match detection. Much smaller datasets. It worked and I starting writing the post on that basis. Then I realised that it was way too limited in the types of questions it could answer and the hard coded scale of the test was limiting, so I went back and improved things.

      The numbers in paragraph 4 are leftovers from the original, artificially simpler, but faster tests. I will update the node.

      As an aside, the same technique can be applied even to datasets where the answers are not yes/no, provided the range of answers can be reduced to a reasonable range of discrete values. Ie. multiple choice as you are doing.

      All too often you see applications storing things like dates, ages & other forms of continuously variable numeric values, when all that is really required for the application is "Under 18 | over 18 and under 65 | over 65" and similar, that can be easily replaced by an enumeration. Many DBs can handle these quite efficiently.

      Unfortunately, they also tend to apply arbitrary limits to the size of an enumeration, 32 or 64 etc. The shame is that in many cases, the limits for the number of indexes that may be applied to a given table (MySQL:32, DB2:(was)255), coincide. In many cases, the use of large enumeration types could substitute for large numbers of indexes with conciderable better efficiency. They can also be used to remove the need for foriegn keys in some cases, for another hike in efficiency.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      SO, I'm embarking on a matrix of tests on that schema

      All you really need to do is learn how to use EXPLAIN. It will tell you how the database processes the query so you can adjust things to make it use correct indexes.

Re: Basic Perl trumps DBI? Or my poor DB design?
by Anonymous Monk on Oct 25, 2004 at 21:21 UTC
    To help you wrap your head around this result, here's a bit of perspective:

    The DBMS is good at two things - finding something in a file with the minimum number of probes and whipping up results that are larger than memory. It generates programs very much like what you wrote to do it, but it has to guess at what methods to use and which files to read first. Actually I'm not sure how good MySQL is at it (though I hear good things) but Oracle and DB2 work that way -- they figure out an algorithm built out of functions they know how to do that logically solves your query and has a chance to do it in good time (but they usually don't think long enough to go for best).

    You're forcing the DBMS to play its worst game - reading all the data to get the answer. Notice that your query necessarily touches every row of each of your designs except DB4. Still, the DB beat or tied perl in all the passes except DB4, and you probably wrote the DB test code quicker for those passes (assuming you're as good with SQL as perl). You could probably tweak DB performance up a bit with some of the suggestions the monks had, but I wouldn't look for it to beat your DB4 program.

    Your program has the advantage of knowing (or accidently finding out :) that all of the answer fits in memory while you put it together - all of the responses for the first question minus all the responses for the second.... It's purpose built for the problem and knows facts about the data that the DB might not "consider" in building a plan. However, if you scale to 200M or so respondents you'll find your program breaking down, while the DB solution will probably grind out your answer using different methods from its toolkit, possibly in reasonably linear time (once you cross the elbow where tempfile swapping starts). We do routinely give up quickest for flexible, but many of us also give up simple for scalable, and quick answers for quick coding. I don't want to try to write your program for a different application where locking and versioning become important.

    So yes, a special purpose engine is right for this problem if response time is important, and you're on the right track (UKBrowser nailed it). But if you find out you want to ask a different type of question you have to build another engine for the question, and people have this annoying habit of asking a different question right after they read a report. Which is why I'd build a DB anyhow, even if you don't use it for this particular application.