Re^3: efficient perl code to count, rank

You don't need to read, pipe, write to disk. Many/most dbs have an import-from-csv option that can read from local disk directly. Yes, it's still a read/write operation, with the added overhead of any required indexing, but we've eliminated the pipe. Not that a local pipe is necessarily slow, but a 62GB reduction in memory copying is still a 62GB reduction in memory copying.
If you create the table with the proper indexing at the outset, which means that the load will be a bit slower, then all operations after that will be significantly faster, and be such directly from disk. You don't need to sort anything if the index is already sorted. You don't need to count anything if the db already knows the count.
And, finally, the big one, the database will be significantly faster and better optimised for swapping intermediate values to disk and back than the kernel is. Leaving this to the kernel is an option, and it works, as you've found out, but what the kernel swaps to disk may not be exactly what you want swapped to disk. A database will have an entire memory management unit for exactly this purpose, where it reads what it needs from disk (whether the tables or the indexes), discards stuff, and saves intermediate information back to temporary space if necessary. Most likely, though, it won't do this, but will read all of the data possibly multiple times from disk, and discard anything it's not using at the current moment in time. This sounds slow, but the reality is that the original code is also doing this, just moreso. Thee kernel is the code that is writing stuff to disk, whereas a database would likely guess correctly more often. Continued from here, though, is that if you have the right indexes on load, the data that the database needs to load may not actually be any of the data from the actual table, it may only be the indexes (which means loading less data per pass, and only the relevant data), and that would allow it to load more at a time, and to possibly not even load any piece of data twice. Maybe. But, with the right query, the database can figure this out.

If the output data starts to get too big, the database can even persist it to disk to be further manipulated, in a specialised structure (that is, a database table), that is amenable to continued modifications as per all of the above caching and discarding and what have you, until it is done producing the output, and then it can return that data more or less straight from disk.

Your basic problem is that you've long passed by your system's memory, and you're hitting kernel swapping. Other than that, the rest of your supposition is probably entirely correct. Dealing with this problem is non-trivial, and is one of the things that actual databases are good at.

Back a couple jobs ago, when I worked for one of the big-three commercial databases, I had access to systems with 256GB RAM. If I were to need to do what you are doing back then on those machines, yeah, perl in-memory likely would have sufficed, and been faster than using the commercial db I had access to (as you rightfully point out, my solution comes with some overhead). But we all have to work within the constraints given to us, and if your system has a "paltry" 16GB RAM, that's your constraint, and you have to find the algorithm that takes that into account. There are such out there, and they've generally been implemented by the database systems, so there's no need to reinvent that wheel, just re-use it.

Also, when your management likes the output you just produced, they're going to ask for more and more analytics. You just know it's going to happen. Throw-away code is very rarely thrown away. And then re-querying the database for more stuff is going to be entirely trivial.

Comment on Re^3: efficient perl code to count, rank

Replies are listed 'Best First'.
Re^4: efficient perl code to count, rank by haj (Vicar) on Jul 18, 2021 at 22:52 UTC
Yeah, agreed on the database-can-read-CSV issue. That eliminates this overhead. But then, the code example of tybalt98 (I had prepared something very similar to run benchmarks) doesn't swap, regardless of how big the dataset is. Time is more or less linear with the number of records. My (not very up-to-date) system processes about 20000 records per minute, which means I wouldn't stand a chance to process 14M records in four hours. NYTProf shows that most of the time goes into preparation and printing the output file. It doesn't even help a lot if output goes to SSD. I wonder what indexing you would apply to the problem at hand? If you can provide an example, I'd be happy to run it against my SQLite or postgres server on the same system for comparison. I don't mind working with databases at all (how could I: I've been working as a product manager for database engines for some years). But in this case the suggestions to use a database (or MCE) all came with little concrete help for the OP and his program. tybalt98 and I found an actual performance issue which, when fixed, gives several orders of magnitude acceleration. How much gain do you expect from switching to a database? How much familiarity with SQL and database functions do the database aficionados expect from the OP? Is this actually helping or is this saying "look how smart I am!"? Also, when your management likes the output you just produced, they're going to ask for more and more analytics. I can confirm that from my own experience. But then, management doesn't ask for a 260GB CSV file, they usually want "two or three slides". One of my most successful Perl programs fell into that category. The evaluation ran once per week for several years. It might have been using a database but it didn't. Actually, no one cared.	[reply]
Re^5: efficient perl code to count, rank by LanX (Saint) on Jul 18, 2021 at 23:17 UTC
Just a reminder, you were the first one suggesting that the OP needs more RAM, see Re: efficient perl code to count, rank Anything can be done with Perl, but search and sort operations requiring Perl to keep all data in memory are usually easier solved (read out-of-the-box) with a DB. Otherwise they require re-implementing sophisticated algorithms to manually "swap" RAM and Disk structures, which doesn't qualify as out-of-the-box for me. NB: But IF the OP really needs such operations is still unclear! We are still speculating what exactly he wanted to be ranked/sorted. (like demonstrated, sorting 14m entries is still feasible in RAM with Perl under 2min, but how does it scale with larger data?) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: efficient perl code to count, rank by haj (Vicar) on Jul 19, 2021 at 07:21 UTC
Yeah, the RAM problem is one one which becomes immediately apparent when looking at the code: Hence my wording "a first guess". As has already been written in this thread (and demonstrated by tybalt89's code), it can be eliminated by working through the file line by line, so with a small change in code the RAM problem does no longer exist. Also, it has nothing to do with sorting, it's just the attempt to slurp a 62GB file into an array. In the followups to the article you quoted "sorting" isn't even mentioned, because it is irrelevant. We are still speculating what exactly he wanted to be ranked/sorted. Looking at the code presented in the original posting should be considered an option. tybalt89 came up with the following explanation, which matches my own interpretation: You were doing the ranking sort for each column... I'm simply assuming that the OP's code performs the operation they want to be done, albeit inefficient. In that code there is not one sort over 14M entries, but there are thousand sorts (one per column). The OP's code does these 1000 sorts 14M times, that's why it won't finish in time, even for small arrays. I hope that the Monks will eventually give tybalt89's article the ranking it deserves.	[reply]


P is for Practical
	PerlMonks