Re^2: Processing ~1 Trillion records

I am very sorry guys, but I can't show the exact code as it's confidential, but it goes like this.

# use DBI oracle
my $dbh = DBI->connect("dbi:Oracle:host=server.domain.com; sid=sid; po
+rt=1521", "username", "password") or die "Can't connect to database $
+DBI::errstr\n";

# The SELECT query
# This query return ~9,450,000,000 records
$sql_chr = qq{select column_a, g.column_b, column_c, column_d, column_
+e from table_f f, table_g g where  f.column_a = g.column_b and column
+_c !='-/-' and column_d= 'STR' and column_e='STR1' and column_a = 9};

my $sth_chr = $dbh->prepare( $sql_chr );

if($sth_chr->execute) {
    my $s=1;
    while (my (@dat) = $sth_chr->fetchrow) {
        ##if ($dat[2]) {$tmarker=translate($dat[2]);}
        ##   else { $tmarker="00000"; }

        $tmarker=translate($dat[2]);
        #$data{$dat[0]}{$dat[1]} = $dat[2];
        $data{$dat[0]}{$dat[1]} = $tmarker;
        $lines{$dat[0]} = "";
        $markers{$dat[1]} = "";
        $tmarkers{$tmarker}="";

        #if (!defined $dat_chr1[1]) { $dat_chr1[1]=0; }

        if (!defined $dat[3]) { $dat[3]=0;}
        if (!defined $dat[4]) { $dat[4]=0;}

        #if ($dat[3] eq undef) { $dat[3]=0;}
        #print $dat_chr1[1]." here is the missing data\n";
        #$datachr{$dat[0]}{$dat[1]}{$dat[3]} = $dat[4];

        $datachr{$dat[1]}{$dat[3]} = $dat[4];
        $chr{$dat[3]} = "";
        # print "$s--> $cropid-->$dat[0]-->$dat[1]-->$dat[2]-->$dat[3]
+-->$dat[4]-->$tmarker\n";
        $s++;
    }
}

#### printing the files

my $count_markers = keys %markers;
my $count_lines = keys %lines;
# my $count_chr=keys %chr;

# print the header
my $count_chr=1;
my $checkchr = printmcdchr($csvfile, $count_markers, $count_chr);

# print the content
my $i=1;
foreach my $m (sort {$a eq $b} keys %markers) {
    print OUT "$m $i\n";
    $i++;
}
[download]

Comment on Re^2: Processing ~1 Trillion records Download Code

Replies are listed 'Best First'.
Re^3: Processing ~1 Trillion records by Jenda (Abbot) on Oct 25, 2012 at 12:08 UTC
You seem to be accumulating lots of data in the hashes, are you sure it all fits in memory? As soon as you force the computer to swap memory pages to disk, the processing time grows insanely! It might help to tie the hashes to a DBM file (DB_File, MLDBM, ...) or use a SQLite or some other database to hold the temporary data. Doing as much work as you can upfront in the Oracle database would most probably be even though. Sometimes a `use DB_File;tie %data, 'DB_File', 'filename.db';` is all you need to change something from unacceptably slow to just fine. Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l]
Re^4: Processing ~1 Trillion records by aossama (Acolyte) on Oct 25, 2012 at 12:36 UTC
Is this like using Redis to store/retrieve the key-value?	[reply]
Re^5: Processing ~1 Trillion records by Jenda (Abbot) on Oct 25, 2012 at 13:37 UTC
Yes. You can use Redis itself, seems it does have a Perl binding. The whole point is to make sure the process fits in memory and the data that had to be moved to the disk is accessed/updated efficiently. Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^3: Processing ~1 Trillion records by Corion (Patriarch) on Oct 25, 2012 at 12:29 UTC
If you only need to provide default values for your output, you can do that in SQL directly. I think Oracle has the NVL() function for that. If that is the only thing your program is doing, then the bottleneck is most likely Oracle, which is not really suited for quickly producing reporting outputs. There are more specialized databases for that, like Sybase IQ. Update: Thinking more about this, if your database is not idle during the 16 days, the transaction facilities of Oracle will also lessen the database speed. Oracle will try to present a coherent state of the database to each connection, and thus will keep rollback logs for the whole 16 days while your query runs. This creates additional overhead that may be slowing your query down. I would look at setting up a separate instance, or even a dedicated reporting database which has the appropriate indices for your queries instead of using the production database. Consider importing the data from the production database into your reporting database by restoring a backup of the production database. This also serves as a way to find out whether your backups can be restored at all.	[reply]
Re^4: Processing ~1 Trillion records by jhourcle (Prior) on Oct 25, 2012 at 16:12 UTC
And if you have enough modifications going on during that time, you'll fill your rollback segment, and the transaction will abort abnormally. (at least, that's how it used to be ... dunno if the latest version still does) When I've had to run long transactions on an Oracle database where I didn't have enough storage for a second copy, I had to play with the pragma commands to set the isolation level ... unfortunately, it was almost a decade ago, and I don't remember what the command was. `serializable` is what keeps coming to mind, but I don't think that's right.	[reply] [d/l]
Re^3: Processing ~1 Trillion records by clueless newbie (Curate) on Oct 25, 2012 at 12:51 UTC
To quote Tim Bunce - while(@row = $sth->fetchrow_array) { } " one column: 51,155 fetches per second " 20 columns: 24,032 fetches per second - while($row = $sth->fetchrow_arrayref) { } " one column: 58,653 fetches per second - approximately 12% faster " 20 columns: 49,390 fetches per second - approximately 51% faster - while($row = shift(@$rowcache) \|\| shift(@{$rowcache=$sth->fetchall_a +rrayref(undef, $max_rows)})) { } " one column: 348,140 fetches per second - by far the fastest! " 20 columns: 42,128 fetches per second - now slower than fetchrow_ +arrayref! " Why? Balance time saved making fewer calls with time spent managi +ng more memory " Do your own benchmarks to find what works best for your situation +s [download]	[reply] [d/l]
Re^3: Processing ~1 Trillion records by MidLifeXis (Monsignor) on Oct 25, 2012 at 12:29 UTC
Profile. Timings I would think would be interesting would be the time taken in the `$sth_chr->execute` statement, the `translate` call (is it something that could be done more efficiently in the select call?), the time spent in the `if(!defined($x)){$x = 0}` statements vs having the DB pre-populate that for you, the time spent in the sorting of the keys, and anything else that shows up in a profiling run. This (as stated in other posts on this thread) is speculation, but I wonder if the sort could be sped up (if it is a bottleneck) by breaking the keys unsorted out, partitioning the data, sorting each individually, and merging the results. The other possibility, if it is the output that is the bottleneck and not the sort, is that the additional I/O would increase the time. Without profiling data it is impossible to know where to focus. --MidLifeXis	[reply] [d/l] [select]
Re^4: Processing ~1 Trillion records by sundialsvc4 (Abbot) on Oct 25, 2012 at 16:06 UTC
Indeed, if you can extract a list of what the so-called “markers” and “lines” are (which of course you can), then the problem decomposes itself quite-nicely into partitions. All that you really need are indexes on the columns in question. It might also be advantageous to define a `view` that corresponds to this query, so that you can examine and then optimize-the-hades-out-of that view. In some DBMSes, defining a view serves to introduce the query as one that it now knows to be frequently-used, and it certainly puts that “6 days long” requirement squarely into the DBA’s management field-of-view.
Re^3: Processing ~1 Trillion records by Mr. Muskrat (Canon) on Oct 25, 2012 at 17:39 UTC
You want to read up on bind_columns because it "is the most efficient way to fetch data" (according to the DBI docs). `my $sth_chr = $dbh->prepare( $sql_chr ); $sth->execute; my %row; my $rv = $sth_chr->bind_columns( \( @row{ @{$sth->{NAME_lc} } } ) ); # + from the DBI docs while ($sth->fetch) { # access data with $row{column_a}, $row{column_b}, etc... }` [download] If the real query selects columns that are restricted to one value in the where clause then you can speed it up tremendously by not returning the static data. In the example query, you only care about column_c so you could remove everything else from the select (including the join). If this isn't the case, you might want to come up with a better example.	[reply] [d/l]
Re^3: Processing ~1 Trillion records by erix (Prior) on Oct 25, 2012 at 10:57 UTC
Timings? Where do those 16 days go?	[reply]
Re^4: Processing ~1 Trillion records by aossama (Acolyte) on Oct 25, 2012 at 11:06 UTC
While executing the while loop	[reply]


Don't ask to ask, just ask
	PerlMonks