Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re^6: Best way to store/access large dataset?

by stonecolddevin (Parson)
on Jun 26, 2018 at 18:47 UTC ( #1217453=note: print w/replies, xml ) Need Help??

in reply to Re^5: Best way to store/access large dataset?
in thread Best way to store/access large dataset?

So when you say "pull calculations", are you talking about performing calculations in the script or pulling data from the database? If you're doing several million/billion calculations against a datasource, it's probably better to try to do some map reducing in a parallel fashion using something like dynamodb and spark/emr. Pulling the rows won't be so hard but having the database crunch a bunch of numbers gets hairy if it's not optimized as such.

Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

  • Comment on Re^6: Best way to store/access large dataset?

Replies are listed 'Best First'.
Re^7: Best way to store/access large dataset?
by Speed_Freak (Sexton) on Jun 26, 2018 at 22:11 UTC

    That's where we aren't sure what to do really. There are three values that correspond to an attribute. Lets say X,Y, and Z. There are a series of qualifiers that have to be met: Lets say Q1, Q2, and Q3. So Q1 will be X has to be greater than Y. Q2 will be X-Y has to be greater than Z. And Q3 will be (X*Y)/Z needs to be greater than (X-Y)*Z. All three of these conditions need to be met in order to say the attribute is present. A binary value of 1.

    I just made those up, so they may not even make sense. But it should illustrate the point.

    Then each group of data I am interested in looking at has a million attributes, and I'm interested in comparing 200 or so of these groups at a time. So I have to either incorporate those qualifiers in a select statement of some sort, or incorporate them into a script that manipulates those raw database values after they are selected. **It would be pertinent to note that the groups of 200 change, and so do the qualifiers.

    Once the binary data sets are created, then those groups have to be evaluated by their assigned category. (A static value assigned to the dataset in the database.) That evaluation is the other part of my question in this thread. How to group the categories, and look for unique attributes.

    Again, the database doesn't exist yet, so I'm working from tab delimited binary datasets where I have already processed the qualifiers.

      I think you're best bet here is to ETL. I have a feeling you're going to want to query this stuff again at some point, and depending on what you're doing you may want to store your calculated results.

      The question here will be how these attributes relate to the data. Is it a tree? Is it one to many? Is it many to many? Does each record have a million attributes, or is a group determined by a certain delimiter and that group has a total of a million attributes?

      If you're looking at a tree data structure, it's probably best to either use Postgres' JSON options or something like DynamoDB or a database built specifically for handling tree data (Neo4j was the last one I knew of but I'm sure others have come about). If it's one row to a million attributes, you probably want to look into some sort of partitioning. This still probably calls for something like DynamoDB since joining that amount of data is going to be a nightmare. Someone like erix can correct me but my personal experience with large amounts of data that require a lot of assembly/transformation has been to use dynamo/cassandra + storm/EMR+Spark.

      I'm not sure if this is all stuff you've considered yet, so please forgive me if I'm just reiterating what's been brought up previously. You might be able to get away with storing this all in an RDBMS and using an EMR cluster to perform the calculations and transformations if you can partition and do everything in memory, but I feel like you're still going to have trouble joining all that data together in a traditional RDBMS without getting clever.

      Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

        ETL? And there are a core group of files that will be repeatedly analyzed. But the overall sets change. So files can be added or removed from the calculations as needed.

        Each record has 3 "columns" of data with a million rows per column. There are a couple other static values that are a single value. I believe that is one to many? And the samples can be grouped by another singular static value stored with the record. (The shape identifier.)

        I'm pretty lost when it comes to the database stuff, so I'm going to point my colleagues here and see what they say honestly!

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1217453]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (1)
As of 2021-04-11 07:26 GMT
Find Nodes?
    Voting Booth?

    No recent polls found