http://qs321.pair.com?node_id=1000799


in reply to Processing ~1 Trillion records

Without seeing the script it is impossible to say how it could be optimized!

Replies are listed 'Best First'.
Re^2: Processing ~1 Trillion records
by aossama (Acolyte) on Oct 25, 2012 at 10:45 UTC
    I am very sorry guys, but I can't show the exact code as it's confidential, but it goes like this.
    # use DBI oracle my $dbh = DBI->connect("dbi:Oracle:host=server.domain.com; sid=sid; po +rt=1521", "username", "password") or die "Can't connect to database $ +DBI::errstr\n"; # The SELECT query # This query return ~9,450,000,000 records $sql_chr = qq{select column_a, g.column_b, column_c, column_d, column_ +e from table_f f, table_g g where f.column_a = g.column_b and column +_c !='-/-' and column_d= 'STR' and column_e='STR1' and column_a = 9}; my $sth_chr = $dbh->prepare( $sql_chr ); if($sth_chr->execute) { my $s=1; while (my (@dat) = $sth_chr->fetchrow) { ##if ($dat[2]) {$tmarker=translate($dat[2]);} ## else { $tmarker="00000"; } $tmarker=translate($dat[2]); #$data{$dat[0]}{$dat[1]} = $dat[2]; $data{$dat[0]}{$dat[1]} = $tmarker; $lines{$dat[0]} = ""; $markers{$dat[1]} = ""; $tmarkers{$tmarker}=""; #if (!defined $dat_chr1[1]) { $dat_chr1[1]=0; } if (!defined $dat[3]) { $dat[3]=0;} if (!defined $dat[4]) { $dat[4]=0;} #if ($dat[3] eq undef) { $dat[3]=0;} #print $dat_chr1[1]." here is the missing data\n"; #$datachr{$dat[0]}{$dat[1]}{$dat[3]} = $dat[4]; $datachr{$dat[1]}{$dat[3]} = $dat[4]; $chr{$dat[3]} = ""; # print "$s--> $cropid-->$dat[0]-->$dat[1]-->$dat[2]-->$dat[3] +-->$dat[4]-->$tmarker\n"; $s++; } } #### printing the files my $count_markers = keys %markers; my $count_lines = keys %lines; # my $count_chr=keys %chr; # print the header my $count_chr=1; my $checkchr = printmcdchr($csvfile, $count_markers, $count_chr); # print the content my $i=1; foreach my $m (sort {$a eq $b} keys %markers) { print OUT "$m $i\n"; $i++; }

      You seem to be accumulating lots of data in the hashes, are you sure it all fits in memory? As soon as you force the computer to swap memory pages to disk, the processing time grows insanely!

      It might help to tie the hashes to a DBM file (DB_File, MLDBM, ...) or use a SQLite or some other database to hold the temporary data. Doing as much work as you can upfront in the Oracle database would most probably be even though. Sometimes a use DB_File;tie %data, 'DB_File', 'filename.db'; is all you need to change something from unacceptably slow to just fine.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        Is this like using Redis to store/retrieve the key-value?

      If you only need to provide default values for your output, you can do that in SQL directly. I think Oracle has the NVL() function for that.

      If that is the only thing your program is doing, then the bottleneck is most likely Oracle, which is not really suited for quickly producing reporting outputs. There are more specialized databases for that, like Sybase IQ.

      Update: Thinking more about this, if your database is not idle during the 16 days, the transaction facilities of Oracle will also lessen the database speed. Oracle will try to present a coherent state of the database to each connection, and thus will keep rollback logs for the whole 16 days while your query runs. This creates additional overhead that may be slowing your query down. I would look at setting up a separate instance, or even a dedicated reporting database which has the appropriate indices for your queries instead of using the production database. Consider importing the data from the production database into your reporting database by restoring a backup of the production database. This also serves as a way to find out whether your backups can be restored at all.

        And if you have enough modifications going on during that time, you'll fill your rollback segment, and the transaction will abort abnormally. (at least, that's how it used to be ... dunno if the latest version still does)

        When I've had to run long transactions on an Oracle database where I didn't have enough storage for a second copy, I had to play with the pragma commands to set the isolation level ... unfortunately, it was almost a decade ago, and I don't remember what the command was. serializable is what keeps coming to mind, but I don't think that's right.

      To quote Tim Bunce

      - while(@row = $sth->fetchrow_array) { } " one column: 51,155 fetches per second " 20 columns: 24,032 fetches per second - while($row = $sth->fetchrow_arrayref) { } " one column: 58,653 fetches per second - approximately 12% faster " 20 columns: 49,390 fetches per second - approximately 51% faster - while($row = shift(@$rowcache) || shift(@{$rowcache=$sth->fetchall_a +rrayref(undef, $max_rows)})) { } " one column: 348,140 fetches per second - by far the fastest! " 20 columns: 42,128 fetches per second - now slower than fetchrow_ +arrayref! " Why? Balance time saved making fewer calls with time spent managi +ng more memory " Do your own benchmarks to find what works best for your situation +s

      Profile.

      Timings I would think would be interesting would be the time taken in the $sth_chr->execute statement, the translate call (is it something that could be done more efficiently in the select call?), the time spent in the if(!defined($x)){$x = 0} statements vs having the DB pre-populate that for you, the time spent in the sorting of the keys, and anything else that shows up in a profiling run.

      This (as stated in other posts on this thread) is speculation, but I wonder if the sort could be sped up (if it is a bottleneck) by breaking the keys unsorted out, partitioning the data, sorting each individually, and merging the results. The other possibility, if it is the output that is the bottleneck and not the sort, is that the additional I/O would increase the time. Without profiling data it is impossible to know where to focus.

      --MidLifeXis

        Indeed, if you can extract a list of what the so-called “markers” and “lines” are (which of course you can), then the problem decomposes itself quite-nicely into partitions.   All that you really need are indexes on the columns in question.

        It might also be advantageous to define a view that corresponds to this query, so that you can examine and then optimize-the-hades-out-of that view.   In some DBMSes, defining a view serves to introduce the query as one that it now knows to be frequently-used, and it certainly puts that “6 days long” requirement squarely into the DBA’s management field-of-view.

      You want to read up on bind_columns because it "is the most efficient way to fetch data" (according to the DBI docs).

      my $sth_chr = $dbh->prepare( $sql_chr ); $sth->execute; my %row; my $rv = $sth_chr->bind_columns( \( @row{ @{$sth->{NAME_lc} } } ) ); # + from the DBI docs while ($sth->fetch) { # access data with $row{column_a}, $row{column_b}, etc... }

      If the real query selects columns that are restricted to one value in the where clause then you can speed it up tremendously by not returning the static data. In the example query, you only care about column_c so you could remove everything else from the select (including the join). If this isn't the case, you might want to come up with a better example.

      Timings? Where do those 16 days go?

        While executing the while loop