comment on

A trillion records? A trillion *bytes* is roughly 1TB. Let's assume that your records are, on average, 32 bytes - they're probably bigger, but that doesn't really matter. So you need to read 32TB, process it, and write 32TB. I don't think it's at all unreasonable to take 16 days for 64TB of I/O.

As you've been told, you need to profile your code. You actually need to do that for any performance problem, not just this one.

I, of course, have not profiled your code, and so everything from here on is mere speculation, but I bet that you are I/O bound. You have at least three places where I/O may be limiting you. Reading from the database (especially if you've only got one database server); transmitting data across the network from the database to the machine your perl code is running on; and writing the CSV back out. At least the first two of those can be minimised by partitioning the data and the workload and parallelising everything across multiple machines. You *may* be able to partition the data such that you can have seperate workers producing CSV files too.

In reply to Re: Processing ~1 Trillion records by DrHyde
in thread Processing ~1 Trillion records by aossama

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks