Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

First of all, I would look at tuning the database, and pushing part of the work into the database. Databases are usually good at aggregation, for example, but depending on your key space size, aggregation or sorting may be prohibitive on your database server unless the indices needed for that already exist.

If working with the database is not a good option, I would write the query results to disk into one or more files, and simply launch the processing in parallel for each file. This approach relies on the fact that all your aggregators are symmetric. For example sum() and count() and max() are symmetric aggregators, because it doesn't matter in which order you visit the rows to find them.

If you need more complicated aggregators, like TOP 10 or a HAVING filter, things will require much more thought - Google Sawzall and MapReduce have produced some papers on symmetric aggregators for other than the trivial stuff.

I've used small SQLite databases to produce and store the intermediate results and ATTACHed the databases to create the final results from them. In my experience SQLite is not suitable for holding (and JOINing) data if the file size goes above 2GB, but that experience was with SQLite 2. Changes to the Btree backend may have improved that limit.

Without knowing about how the data is fetched from the database and how the restructuring is to be done, it's hard to suggest a proper optimization. Just be warned that for example some tied hashes do not like to have more than 4GB keys.


In reply to Re: Processing ~1 Trillion records by Corion
in thread Processing ~1 Trillion records by aossama

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2022-08-16 07:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?