comment on

I think you're best bet here is to ETL. I have a feeling you're going to want to query this stuff again at some point, and depending on what you're doing you may want to store your calculated results.

The question here will be how these attributes relate to the data. Is it a tree? Is it one to many? Is it many to many? Does each record have a million attributes, or is a group determined by a certain delimiter and that group has a total of a million attributes?

If you're looking at a tree data structure, it's probably best to either use Postgres' JSON options or something like DynamoDB or a database built specifically for handling tree data (Neo4j was the last one I knew of but I'm sure others have come about). If it's one row to a million attributes, you probably want to look into some sort of partitioning. This still probably calls for something like DynamoDB since joining that amount of data is going to be a nightmare. Someone like erix can correct me but my personal experience with large amounts of data that require a lot of assembly/transformation has been to use dynamo/cassandra + storm/EMR+Spark.

I'm not sure if this is all stuff you've considered yet, so please forgive me if I'm just reiterating what's been brought up previously. You might be able to get away with storing this all in an RDBMS and using an EMR cluster to perform the calculations and transformations if you can partition and do everything in memory, but I feel like you're still going to have trouble joining all that data together in a traditional RDBMS without getting clever.

Three thousand years of beautiful tradition, from Moses to Sandy Koufax, you're god damn right I'm living in the fucking past

In reply to Re^8: Best way to store/access large dataset? by stonecolddevin
in thread Best way to store/access large dataset? by Speed_Freak

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl: the Markov chain saw
	PerlMonks