Strategy for managing a very large database with Perl

punkish has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 08:33 UTC
I'd store it as video. Bear with me. If we take a look at your table definition, we can exclude (for now) t_id and point_id as they are just a part of the RDBMS data representation rather than the actual data. Year and day effectively denote 'frame'. That leaves 3 ints and 3 reals. Assuming (I know foolish), that 2 of those reals represent X & Y coordinates. We might also think of the 3 integers as being roughly equivalent to RGB. and the last as gamma (or alpha). 20 years of days is just over 7000 frames, which at say 30 frames per second equates to a video clip of just over 4 minutes. 100 billion rows / 7000 frames (assuming a roughly 16/9 frame shape) gives roughly a 6000 x 2500 frame size. That's approximately 10x the size of an HD video frame. So 40 minutes of HD movie. And 1 hour of HD movie is about 1 GB to download. Now, that's mostly an academic exercise as whilst video codec compression is highly developed, it is lossy, and so not really much good for academic data. However, it could lead to thinking about a different way of storing your data that might both reduce the storage requirements and lend itself to the type of queries that will be made of it. For example, if your data can be represented by a (say 6000 x 2500) 2D array of 3 ints and 1 float per point per day. Then that equates to a 6e3 x 2.5e3 x 16-bytes frame per day. Or 228 MB raw. * 20 years (7000) = 1.6 TB. Much less than your 10 TB projection. Where has the savings come from? First we've moved the year/day information out of every record. It's now just stored once in the name of the file. We've also moved the X & Y coordinates out of every record as they are now implicit in the position of the data within the file. And your typical queries are straight forward: "the values of a <variable> for a given point* or a given set of points for a particular year+yday combination.", becomes: locate the appropriate file (date), and a direct read of 16 bytes from a directly calculated offset for each point or given set of points. No data loss, just a more compact representation that lends itself directly to the types of queries anticipated. Of course, once you store the data this way, you can then use some other video techniques to compress it further--without loss*. You might for example store a "full frame" for the first day of every month, and then a set of { X,Y new value(s) } changes per day in between. This might make sense if only a small subset of the full frame is likely to change on any given day. (Much like video phones transmit change sets that are applied to periodic full frames.) The full dataset for any given date can quickly be reconstructed (losslessly) by grabbing the preceding "full frame" and then applying the change sets to bring it up to date. And if the particular query being satisfied only applies to a small area of the full frame then pre-filtering the change sets speeds the processing. Finally, if the 3 ints and the real for each X,Y position are compatible with representation as RGB & gamma (or one of the other color encodings), then displaying (sub-areas of) the frames as a movie would be a powerful way of visualising the changes in the datasets over time. Just a thought that normalised relational forms are often not the best way of storing and manipulating datasets. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^2: Strategy for managing a very large database with Perl (Video) by punkish (Priest) on Jun 18, 2010 at 12:52 UTC
BrowerUk> I'd store it as video. Bear with me. First, a wonderfully whacked out suggestion. I wouldn't expect any less from you, BrowserUk, as you always come up with an intriguing possibility. A few specific responses: First, a series of video frames is not that outlandish, actually. A video frame is nothing but a series of 2D images that flutter by our eyes at 30 fps. So, it is akin to storing a bunch of 2D images. Doing locational searches is, as you suggested, finding the correct file, and then finding the subset from that file. But, these would be a lot of very large images... a pretty bad combination. My dataset's 2D dims are 4,587 X 2,889 = 13,251,843 points. 23 years is 23 * 365 (or 366) = 8395 (or 8418) images. Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval. See my response to moritz at Re^2: Strategy for managing a very large database with Perl. What am I optimizing for? Space? I don't really care about space, because disk drive is a lot cheaper than my time. Integrity? Once loaded, the data are read-only. Ease and speed of retrieval? Yes and Yes. I only care about how easily and quickly I can get data that I want. Say, I want to create an image of variable a over an area for a particular day. That is a simple db query `SELECT var FROM table WHERE year = ? AND yday = ?`. Note: A columnar database such as Vertica or MonetDB might be very speed efficient for these queries, but those db lack spatial searches, and suffer from the lack of ease aspect. Nevertheless, your idea is very intriguing, and I am going to do a few tests with array storage. Of course, as mentioned in an earlier posting, the data are already in NetCDF, an array storage format, so I have to do something new that overcomes the shortcomings of the current format. The main shortcoming of the current format is the inability to do spatial searches for arbitrary spatial bounds. Thanks again, for a wonderful response. -- when small people start casting long shadows, it is time to go to bed	[reply] [d/l]
Re^3: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 14:45 UTC
Plus, each "pixel" in the image is an array of my 6 or 7 values. So, we are back to complexity in retrieval. I don't know why you say that? The code below plucks a rectangle of data points of a specified size from a specified year/day "image". I dummied up two days worth of data files: `C:\test>dir tmp 18/06/2010 14:28 371,260,960 2010.168.bin 18/06/2010 14:22 371,260,960 2010.169.bin 2 File(s) 742,521,920 bytes` [download] And this shows the code plucking 10 x 10 x 7 datasets from various positions within each of those files (with the output redirected for clarity). The code is just a little math, a read and an unpack--most of the posted code is just parsing the arguments and formatting the output and timing: for /l %y in (0,500,2500) do @845309 2010 169 2293:9 %y:9 >nul [2010 169 2293:9 0:9] Took 0.020 seconds [2010 169 2293:9 500:9] Took 0.017 seconds [2010 169 2293:9 1000:9] Took 0.017 seconds [2010 169 2293:9 1500:9] Took 0.017 seconds [2010 169 2293:9 2000:9] Took 0.019 seconds [2010 169 2293:9 2500:9] Took 0.017 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:9 %y:9 >nul [2010 168 2293:9 0:9] Took 0.021 seconds [2010 168 2293:9 500:9] Took 0.017 seconds [2010 168 2293:9 1000:9] Took 0.017 seconds [2010 168 2293:9 1500:9] Took 0.066 seconds [2010 168 2293:9 2000:9] Took 0.023 seconds [2010 168 2293:9 2500:9] Took 0.017 seconds [download] And here 100 x 100 x 7 data points. Very linear as expected. for /l %y in (0,500,2500) do @845309 2010 169 2293:99 %y:99 >nul [2010 169 2293:99 0:99] Took 0.115 seconds [2010 169 2293:99 500:99] Took 0.115 seconds [2010 169 2293:99 1000:99] Took 0.117 seconds [2010 169 2293:99 1500:99] Took 0.116 seconds [2010 169 2293:99 2000:99] Took 0.115 seconds [2010 169 2293:99 2500:99] Took 0.116 seconds for /l %y in (0,500,2500) do @845309 2010 168 2293:99 %y:99 >nul [2010 168 2293:99 0:99] Took 0.125 seconds [2010 168 2293:99 500:99] Took 0.116 seconds [2010 168 2293:99 1000:99] Took 0.114 seconds [2010 168 2293:99 1500:99] Took 0.115 seconds [2010 168 2293:99 2000:99] Took 0.115 seconds [2010 168 2293:99 2500:99] Took 0.115 seconds [download] So, very simple code and very fast. And the entire uncompressed dataset (23 * 365.25 * 354MB) = < 3TB. With compression, that could be as little as 1.3 TB. Though you'd have a pay the price for unpacking--~30 seconds per file. `18/06/2010 14:28 237,173,932 2010.168.bin.gz 18/06/2010 14:22 175,868,626 2010.169.bin.bz2` [download] But the main point of partitioning your dataset this way is that you reduce the search space to 1/8th of 1% as soon as you specify the year/day. And there is no searching of indexes involved in the rest of the query. Just a simple calculation and a direct seek. Anyway, t'is your data and your employers money :) The (flawed) test demo code. #! perl -slw use strict; use Time::HiRes qw[ time ]; use constant { XMAX => 4587, YMAX => 2889, REC_SIZE => 7 * 4, }; my( $year, $day, $xRange, $yRange ) = @ARGV; my( $xStart, $xEnd ) = split ':', $xRange $xEnd += $xStart; my( $yStart, $yEnd ) = split ':', $yRange; $yEnd += $yStart; my $start = time; open BIN, '<:perlio', "tmp/$year.$day.bin" or die $!; binmode BIN; my $xLen = ( $xEnd - $xStart + 1 ) * REC_SIZE; for my $y ( $yStart .. $yEnd ) { my $pos = ( $y * XMAX * REC_SIZE ) + $xStart * REC_SIZE; seek BIN, $pos, 0; my $read = sysread( BIN, my $rec, $xLen ) or die $!; my @recs = unpack '(A28)*', $rec; for my $x ( $xStart .. $xEnd ) { my( $a, $b, $c, $d, $e, $f, $g ) = unpack 'N7', $recs[ $x - $x +Start ]; printf "%4d.%03d : %10u %10u %10u %10u %10u %10u %10u\n", $year, $day, $a, $b, $c, $d, $e, $f, $g//0; } } close BIN; printf STDERR "[@ARGV] Took %.3f seconds\n", time() - $start; [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re^4: Strategy for managing a very large database with Perl (Video) by punkish (Priest) on Jun 18, 2010 at 15:05 UTC
Re^5: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 15:45 UTC
Re^4: Strategy for managing a very large database with Perl (Video) by Corion (Patriarch) on Jun 18, 2010 at 16:06 UTC
Re^5: Strategy for managing a very large database with Perl (Video) by BrowserUk (Patriarch) on Jun 18, 2010 at 16:22 UTC
Re: Strategy for managing a very large database with Perl by dws (Chancellor) on Jun 18, 2010 at 05:55 UTC
A typical search would be to retrieve all the values of a <variable> for or for a particular year+yday combination. If those are the typical search patterns, is a relational database the right hammer to reach for? Why not flat files? Appending to a flat file is a quick operation. If you keep one flat file per day, scanning for the subset of records for that day is trivial. If you have to do a full scan through all the data, one file per day is also easy, and allows the scan to be parallelized if you ever need to throw more machines at the problem. 10Tb over 20 years is 500Mb a year, or very roughly 1.5Mb a day. That's not an unreasonable size for a flat file.	[reply]
Re^2: Strategy for managing a very large database with Perl by punkish (Priest) on Jun 18, 2010 at 12:13 UTC
punkish>> A typical search would be to retrieve all the values punkish>> of for or <for a set of points> punkish>> <for a particular year+yday combination>. dws> If those are the typical search patterns, Those are typical search patterns. There are a few other search patterns as well, so I have to find a format that can accommodate all of them. dws> is a relational database the right hammer to reach for? Dunno. Hence this thread. dws> Why not flat files? Well, the data are already in some kind of flat files. Actually, they are in files, but not flat files. They are in NetCDF format. The format is very compact, and very suitable for storing array of arrays, however, it is not very amenable to spatial searches. Hence, my desire to put the data in a database. As shown in my original post, the db table is very simple, almost like a flat file (with a little overhead per "line" added by the db). However, it allows me to JOIN this simple table with a location table. I can do a spatial search on the location table, retrieve the value of points that fall within my search, and then retrieve the variables from this simple but mongo table. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re: Strategy for managing a very large database with Perl by moritz (Cardinal) on Jun 18, 2010 at 06:34 UTC
To me, partitioning by year sounds like a good idea.	[reply]
Re^2: Strategy for managing a very large database with Perl by punkish (Priest) on Jun 18, 2010 at 12:29 UTC
moritz> To me, partitioning by year sounds like a good idea. Thanks. I might try that. That would result in 23 tables, each with 4_850_174_538 rows by however many columns. That is still almost 5 billion rows, but better than 115 billion. Now, here is where some db chops are needed. Is a db really inefficient managing 115 billion rows as opposed to 5 billion? Well, yes, for certain operations it can be so. But, as far as I understand, a db manages data in pages, so, it is analogous to reading data from file line by line, except, each line, in this case, is a page. The db knows which page to go to with the help of indexes. So, it doesn't matter how big the table is. The db goes to the right page to grab the data. In this way, it is analogous to knowing the correct file offset to go to the right place in the file to read the data. Sure, I could implement all that crap myself, but the db has already figured it out. Since my data are pretty much read-only, what do I gain by tinkering with the rather simple scheme I have right now? I can optimize for space, but 10 TB doesn't sound very big, especially when some of my colleagues are deploying 100s of TB for satellite imagery. I can optimize for rows, and while 115 billion is indeed a very large number, does it affect my operation? In the end, I want to optimize for ease and speed of retrieval, so that is where I want to concentrate on. The partitioning docs indicate that that cause might be helped by virtue of having smaller INDEXes for specific searches. Thanks for the suggestion. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re: Strategy for managing a very large database with Perl by JavaFan (Canon) on Jun 18, 2010 at 07:15 UTC
Considering your title, I wonder, would you have asked different questions, or expect a different answer, if you were going to manage the database with Python? Or C? Or something else? I really think you'd get much better answers in a database forum than here.	[reply]
Re^2: Strategy for managing a very large database with Perl by punkish (Priest) on Jun 18, 2010 at 12:21 UTC
JavaFan> Considering your title, I wonder, would you have JavaFan> asked different questions, or expect a different answer, JavaFan> if you were going to manage the database with Python? JavaFan> Or C? Or something else? Don't know if would have asked a different question if managing the db with Python or C, because I would probably have had a different mindset. JavaFan> I really think you'd get much better answers in a JavaFan> database forum than here. You are possibly correct, but there are db wizards here on Perlmonks, and they are Perl-friendly and knowledgable, so I wouldn't be surprised if they came up with good suggestions. They already have. See BrowserUk's wonderfully whacked out suggestion: Re: Strategy for managing a very large database with Perl (Video). That is why I love Perlmonks. -- when small people start casting long shadows, it is time to go to bed	[reply]
Re: Strategy for managing a very large database with Perl by Fletch (Bishop) on Jun 18, 2010 at 14:06 UTC
TMTOWTDI, and sometimes they don't involve Perl: this size data set is getting into the realm where something like Pig might be more suitable. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: Strategy for managing a very large database with Perl by metaperl (Curate) on Jun 18, 2010 at 16:18 UTC
Have you considered any of the NOSQL databases, such as Redis? Also, since you are using Postgres, maybe ask on their user lists... the guys who maintain the database often answer themselves. The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]