Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: speeding up row by row lookup in a large db

by sundialsvc4 (Abbot)
on Mar 22, 2009 at 01:27 UTC ( [id://752317]=note: print w/replies, xml ) Need Help??


in reply to speeding up row by row lookup in a large db

Let's take a history-lesson from ... punched cards, of all things. When you have a large amount of data, sort it. If you have multiple data streams to integrate, sort each one by the same key.

When you do this, all the records having any key value will be adjacent, and within any gaps there are known to be no keys at all. No searching is required, and you never need to consider anything more than: this record, and the preceding one.

The thing that is eating your lunch here is a familiar one: thousands (or millions) of repetitive searches. “Come with me back to the days when more data than this was processed using much less computer power than today you will find in a dishwasher. It was not a world of hard drives: it was a world of magnetic tapes.” The procedures that were adopted out of necessity back then, are just as efficient today.

The payoff for your efforts can be ... stunning.

Replies are listed 'Best First'.
Re^2: speeding up row by row lookup in a large db
by punkish (Priest) on Mar 22, 2009 at 02:32 UTC
    Appreciate the advice sundialsvc4. The data actually are sorted. I figured out the problem with the help of the most excellent Devel::NYTProf. The problem was in pulling the met data which is 7300 rows by 9 columns for each cell. This transaction was taking about 28 to 29 ms. I have to figure out how to either compress this data or represent it in some other way so that it gets pulled out quickly.
    --

    when small people start casting long shadows, it is time to go to bed

      Was the cell table composed of 100,000 or 1,000,000 rows? If 100,000 then most assuredly keep this in memory as a hash (i.e. read whole thing only once). That might not be a bad idea even if it's a million. And obviously do the same with the tiny lc and dist tables.

      Consider storing the data in each table as a blob, but first pack it, and blob the packed entity. IIRC, sqlite keeps data as strings, so it's more voluminous than the actual data. And then you have the cost of all those atoi's to decode it.

      And if your data can be organized by lat-long, consider a small database for each lat-long region.

      Also, consider the CPAN module Cache::Memcached.

        So, the cell table has 1 million rows. Each cell has (in the current study), a 7300 rows by 9 columns table of weather data. There are 400 sets of weather data, so there is some level of normalization.

        I serialized and compressed the weather data (using Storable via Data::Serializer) and stored each set as a BLOB. Unfortunately, the speed of extracting the BLOB and de-serializing and uncompressing it is sufficiently high... about 28 ms per set, which is really slow. In fact, it is quicker to simply pull out 7300 rows from the table than to pull out one row of BLOB and unraveling it.

        I will look at Cache::Memcached.

        --

        when small people start casting long shadows, it is time to go to bed

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://752317]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-20 00:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found