Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^4: speeding up row by row lookup in a large db

by punkish (Priest)
on Mar 23, 2009 at 02:09 UTC ( [id://752472]=note: print w/replies, xml ) Need Help??


in reply to Re^3: speeding up row by row lookup in a large db
in thread speeding up row by row lookup in a large db

So, the cell table has 1 million rows. Each cell has (in the current study), a 7300 rows by 9 columns table of weather data. There are 400 sets of weather data, so there is some level of normalization.

I serialized and compressed the weather data (using Storable via Data::Serializer) and stored each set as a BLOB. Unfortunately, the speed of extracting the BLOB and de-serializing and uncompressing it is sufficiently high... about 28 ms per set, which is really slow. In fact, it is quicker to simply pull out 7300 rows from the table than to pull out one row of BLOB and unraveling it.

I will look at Cache::Memcached.

--

when small people start casting long shadows, it is time to go to bed
  • Comment on Re^4: speeding up row by row lookup in a large db

Replies are listed 'Best First'.
Re^5: speeding up row by row lookup in a large db
by BrowserUk (Patriarch) on Mar 23, 2009 at 03:27 UTC
    So, the cell table has 1 million rows. Each cell has (in the current study), a 7300 rows by 9 columns table of weather data. There are 400 sets of weather data, so there is some level of normalization.

    Are you saying that each of the million CELL rows, references one of 400 sets of weather data (each of which is 7300x9)?

    And that for each CELL you wish to load the appropriate weather dataset?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Are you saying that each of the million CELL rows, references one of 400 sets of weather data (each of which is 7300x9)? And that for each CELL you wish to load the appropriate weather dataset?

      yes. Fortunately not all at the same time, but I want to reduce the time to load *all* the data for _a_ cell. The time I am getting as of now is around 30 ms.

      Update: Because, in the future (next iteration of the project), the number of cells is going to jump to about 1.9 million, and the size of weather data set might increase (currently we have 20 years of data, hence 20*365=7300 rows). Hence, all the focus on reducing the time to retrieve all the data for any given cell.

      --

      when small people start casting long shadows, it is time to go to bed

        Hm. Then load the cell<->met index as one array, and the 400 x 7300 x 9 datasets as another 3D array.

        In the following dummied up test, loading all the data takes around 45 seconds and 2.9 GB. That's space & comma separated ascii. You could certainly speed the loading up using Storable

        And accessing one value within the datasets associated with each cell, happens at a rate of 27,300 cells per second. That means you could access (1 value for) all 1 million cells in just over 36 seconds.

        Update: 800,000+ cells per second! Less than 2 seconds to process the million cells.

        #! perl -slw use strict; use Storable; use Time::HiRes qw[ time ]; our $N ||= 1e3; my $start1 = time; my @cells; open CELLS, '<', 'cells.dat' or die $!; m[(\d+)\s+(\d+)] and $cells[ $1 ] = $2 while <CELLS>; close CELLS; my @met; $#met = 401; for my $met ( 1 .. 400 ) { open IN, '<', sprintf "met%04d.dat", $met or die "dat $met : $!"; local $/; my @data; $#data = 7300; @data = map [ split ', ' ], map{ split "\n" } <IN>; close IN; push @met, \@data; } printf 'All data loaded in %.2f seconds', time() - $start1; <>; my $start2 = time; for my $cell ( 1 .. $N ) { my $row = int rand 7300; my $col = int rand 9; my $value = $met[ $cells[ $cell ] ][ $row ][ $col ]; } printf "Accessed $N met datasets at a rate of %.2f\n", $N / ( time - $start2 ); __END__ c:\test\752472>752472.pl All data loaded in 41.81 seconds Accessed 1000 met datasets at a rate of 27317.87 c:\test\752472>752472 -N=1e6 All data loaded in 28.31 seconds Accessed 1e6 met datasets at a rate of 842459.96

        cells.dat is a flat ascii file with records associating the cell to it's met data:

        c:\test\752472>head CELLS.DAT 0000001 0170 0000002 0147 0000003 0349 0000004 0251 0000005 0104 0000006 0006 0000007 0121 0000008 0312 0000009 0325 0000010 0131

        And metNNNN.dat are:

        c:\test\752472>head MET0001.dat 650, 81, 762, 183, 875, 391, 191, 871, 712 686, 359, 795, 893, 570, 551, 308, 229, 515 140, 420, 808, 387, 642, 744, 800, 151, 215 136, 151, 934, 113, 454, 934, 75, 134, 332 651, 349, 599, 377, 269, 994, 886, 697, 260 712, 42, 166, 428, 300, 533, 331, 821, 60 625, 367, 483, 652, 461, 986, 350, 538, 343 715, 58, 683, 261, 427, 947, 662, 195, 564 49, 831, 230, 921, 330, 720, 998, 435, 975 299, 502, 850, 997, 285, 829, 694, 946, 131

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Update: Because, in the future (next iteration of the project), the number of cells is going to jump to about 1.9 million, and the size of weather data set might increase (currently we have 20 years of data, hence 20*365=7300 rows). Hence, all the focus on reducing the time to retrieve all the data for any given cell.

        As a quick test, I increased the weather datasets to 50 years (50*365*9) and the number of cells to 3 million. I also packed the rows (Corion's idea!) to save a little space and time. The result is that you'd need 1.4 GB of ram to hold the data; 25 seconds to load it all; and just under 19 seconds to process all 3 million:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; our $N ||= 1e3; my $start1 = time; my @cells; open CELLS, '<', 'cells.dat' or die $!; m[(\d+)\s+(\d+)] and $cells[ $1 ] = $2 while <CELLS>; close CELLS; my @met = []; for my $met ( 1 .. 400 ) { open IN, '<', sprintf "met%04d.dat", $met or die "dat $met : $!"; local $/; my @data; $#data = 7300; @data = map{ pack 'V*', split ', ', $_ } map{ split "\n" } <IN>; close IN; push @met, \@data; } printf 'All data loaded in %.2f seconds', time() - $start1; <>; my $start2 = time; for my $cell ( 1 .. $N ) { my $row = int rand 7300; my $col = int rand 9; my $rowData = $met[ $cells[ $cell ] ][ $row ]; my @values = unpack 'V*', $rowData; my $value = $values[ $col ]; } printf "Accessed $N met datasets at a rate of %.2f\n", $N / ( time - $start2 ); __END__ c:\test\752472>752472 -N=3e6 All data loaded in 24.97 seconds Accessed 3e6 met datasets at a rate of 161281.65

        That still leaves plenty of overhead on your 32 GB machine, even if you stick with 32-bit process, and is orders of magnitude faster than any DB solution will achieve.

        If you need to go faster still, you could store and load the datsets as 2D piddles (which would also save a little memory though that doesn't seem to be a problem as is), and do your math using PDL which would certainly benefit the performance if you are doing anythng like the typical manipulations applied to weather data.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://752472]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-24 06:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found