Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re^4: Best way to store/access large dataset?

by stonecolddevin (Parson)
on Jun 22, 2018 at 19:44 UTC ( #1217252=note: print w/replies, xml ) Need Help??

in reply to Re^3: Best way to store/access large dataset?
in thread Best way to store/access large dataset?

I'd be curious to see what you read that said Postgres wasn't recommended for lots of read operations. I don't think I've ever heard that before.

If you're deeply invested in mariadb, it's probably fine. mysql has a lot of pitfalls, but people use it in large scale cases all the time.

Regardless, my personal preference is Postgres. I don't think there would be an issues using it for high read volume or processing a large number of calculations, but it depends on what kind of traffic it's going to be taking. If it's a really specialized case, it's probably worth looking into some ETL (extract/transform/load) on AWS using EMR (Elastic MapReduce) and/or Athena.

The key things here are how much data you're dealing with, how many calculations you need to perform, and how resource intensive those calculations are. I think Postgres will be just fine up to several million rows but if you're doing a ton of joining it might get hairy and be better to spread the work out a bit.

Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

  • Comment on Re^4: Best way to store/access large dataset?

Replies are listed 'Best First'.
Re^5: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 22, 2018 at 20:07 UTC

    I did a quick google on Postgres pro's and cons. I did just read further in the comments that it's solvable by "adding connection pooling on front."

    I'm open to anything really. The boss didn't wanna pony up the cash to get someone in here that could make solid we're just winging it! One of my colleagues is familiar with Mariadb, so we went with it.

    The database holds environmental sample data. Each sample contains just over three million data points. For what I'm describing here, I have to pull just under three million of those points for around 200 samples worth of data. (200-300 should be the normal data load.) That initial pull of data will be around 1.8 billion calculations if the qualifiers are relatively simple. The qualifiers are user definable, so they could range from simple greater than/less than, to various combinations of percentages of different values from the database. Following that comes this script which will ultimately perform an additional ~49 million calculations on the summary table to find the unique attributes.(A chain of greater than less than qualifiers based on attribute count and category count for each attribute in each category.)

      So when you say "pull calculations", are you talking about performing calculations in the script or pulling data from the database? If you're doing several million/billion calculations against a datasource, it's probably better to try to do some map reducing in a parallel fashion using something like dynamodb and spark/emr. Pulling the rows won't be so hard but having the database crunch a bunch of numbers gets hairy if it's not optimized as such.

      Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

        That's where we aren't sure what to do really. There are three values that correspond to an attribute. Lets say X,Y, and Z. There are a series of qualifiers that have to be met: Lets say Q1, Q2, and Q3. So Q1 will be X has to be greater than Y. Q2 will be X-Y has to be greater than Z. And Q3 will be (X*Y)/Z needs to be greater than (X-Y)*Z. All three of these conditions need to be met in order to say the attribute is present. A binary value of 1.

        I just made those up, so they may not even make sense. But it should illustrate the point.

        Then each group of data I am interested in looking at has a million attributes, and I'm interested in comparing 200 or so of these groups at a time. So I have to either incorporate those qualifiers in a select statement of some sort, or incorporate them into a script that manipulates those raw database values after they are selected. **It would be pertinent to note that the groups of 200 change, and so do the qualifiers.

        Once the binary data sets are created, then those groups have to be evaluated by their assigned category. (A static value assigned to the dataset in the database.) That evaluation is the other part of my question in this thread. How to group the categories, and look for unique attributes.

        Again, the database doesn't exist yet, so I'm working from tab delimited binary datasets where I have already processed the qualifiers.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1217252]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2021-04-14 05:57 GMT
Find Nodes?
    Voting Booth?

    No recent polls found