Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: efficient perl code to count, rank

by Tanktalus (Canon)
on Jul 17, 2021 at 20:11 UTC ( [id://11135110]=note: print w/replies, xml ) Need Help??


in reply to efficient perl code to count, rank

Simple rule: when you have data, think database. I don't care if it's mysql, postgres, Db2, sqlite, but use a db.

In this particular case, especially, you'll probably find it runs faster to load the data, ask the db to give you the output, and discard the db than it is to try to do it in memory. Databases are often optimised for this type of data access, and will be a much more efficient (time-wise especially) use of disk than swapping / paging. Sqlite probably will suffice, but, if not, mysql and postgres are free, and not difficult to install on most linux distros, and not really much harder on Windows, even without WSL.

I know, you asked a perl question, and there's not a lick of perl in this answer. But the reality is that while you can do this in any turing-complete language, you still should use the right tool for the job, and, in this case, a database is that right tool.

Replies are listed 'Best First'.
Re^2: efficient perl code to count, rank
by haj (Vicar) on Jul 17, 2021 at 20:35 UTC

    I doubt that a database will help here.

    This is a linear process through a large dataset, and through the whole dataset. No query, no selection, no filter, no relations or anything where an index would help, no need for transactions or concurrent access. The hashes which collect the result are good enough as a surrogate for a in-memory database (and they should fit into memory)

    A database comes with significant overhead: You need to INSERT 14 millions of records before you can even start. This means reading 62GB from disk, piping them through your database socket, and then the database writes them to disk again. And then you SELECT all of them which means reading from disk and piping them through the database socket again. How would a database compensate for this?

      A database comes with significant overhead: You need to INSERT 14 millions of records before you can even start

      That's not necessarily true: one can (for instance, in postgres) use the SQL-machinery against a so-called foreign table, that uses a text file as underlying data. That means no INSERT is necessary.

      Making a foreign table takes no time (after all, it's really just a configuration) but of course any reading or sorting with many GBs will take approximately as long as in any another programming language. The advantage would be access via SQL. (BTW, I'm not saying such access via database is useful for the OP, he may have the overhead of learning this particular trick).

      (And yes, I did try it out: retrieving a few values from a foreign table that sits on top of a csv-file of 27GB (1250 columns, 2M rows), via

      SELECT column10 , column20 FROM junk.broadcsv ORDER BY column100 DESC LIMIT 10

      took ~10 minutes on my old 8GB desktop)

        one can (for instance, in postgres) use the SQL-machinery against a so-called foreign table,

        Fair enough! That would be especially convenient if the database offers some aggregation functions to perform the logic of the Perl code, so that you wouldn't even need to SELECT all the stuff back. That's beyond my postgres-fu, though.

      • You don't need to read, pipe, write to disk. Many/most dbs have an import-from-csv option that can read from local disk directly. Yes, it's still a read/write operation, with the added overhead of any required indexing, but we've eliminated the pipe. Not that a local pipe is necessarily slow, but a 62GB reduction in memory copying is still a 62GB reduction in memory copying.
      • If you create the table with the proper indexing at the outset, which means that the load will be a bit slower, then all operations after that will be significantly faster, and be such directly from disk. You don't need to sort anything if the index is already sorted. You don't need to count anything if the db already knows the count.
      • And, finally, the big one, the database will be significantly faster and better optimised for swapping intermediate values to disk and back than the kernel is. Leaving this to the kernel is an option, and it works, as you've found out, but what the kernel swaps to disk may not be exactly what you want swapped to disk. A database will have an entire memory management unit for exactly this purpose, where it reads what it needs from disk (whether the tables or the indexes), discards stuff, and saves intermediate information back to temporary space if necessary. Most likely, though, it won't do this, but will read all of the data possibly multiple times from disk, and discard anything it's not using at the current moment in time. This sounds slow, but the reality is that the original code is also doing this, just moreso. Thee kernel is the code that is writing stuff to disk, whereas a database would likely guess correctly more often. Continued from here, though, is that if you have the right indexes on load, the data that the database needs to load may not actually be any of the data from the actual table, it may only be the indexes (which means loading less data per pass, and only the relevant data), and that would allow it to load more at a time, and to possibly not even load any piece of data twice. Maybe. But, with the right query, the database can figure this out.

        If the output data starts to get too big, the database can even persist it to disk to be further manipulated, in a specialised structure (that is, a database table), that is amenable to continued modifications as per all of the above caching and discarding and what have you, until it is done producing the output, and then it can return that data more or less straight from disk.

      Your basic problem is that you've long passed by your system's memory, and you're hitting kernel swapping. Other than that, the rest of your supposition is probably entirely correct. Dealing with this problem is non-trivial, and is one of the things that actual databases are good at.

      Back a couple jobs ago, when I worked for one of the big-three commercial databases, I had access to systems with 256GB RAM. If I were to need to do what you are doing back then on those machines, yeah, perl in-memory likely would have sufficed, and been faster than using the commercial db I had access to (as you rightfully point out, my solution comes with some overhead). But we all have to work within the constraints given to us, and if your system has a "paltry" 16GB RAM, that's your constraint, and you have to find the algorithm that takes that into account. There are such out there, and they've generally been implemented by the database systems, so there's no need to reinvent that wheel, just re-use it.

      Also, when your management likes the output you just produced, they're going to ask for more and more analytics. You just know it's going to happen. Throw-away code is very rarely thrown away. And then re-querying the database for more stuff is going to be entirely trivial.

        Yeah, agreed on the database-can-read-CSV issue. That eliminates this overhead.

        But then, the code example of tybalt98 (I had prepared something very similar to run benchmarks) doesn't swap, regardless of how big the dataset is. Time is more or less linear with the number of records. My (not very up-to-date) system processes about 20000 records per minute, which means I wouldn't stand a chance to process 14M records in four hours. NYTProf shows that most of the time goes into preparation and printing the output file. It doesn't even help a lot if output goes to SSD.

        I wonder what indexing you would apply to the problem at hand? If you can provide an example, I'd be happy to run it against my SQLite or postgres server on the same system for comparison. I don't mind working with databases at all (how could I: I've been working as a product manager for database engines for some years). But in this case the suggestions to use a database (or MCE) all came with little concrete help for the OP and his program. tybalt98 and I found an actual performance issue which, when fixed, gives several orders of magnitude acceleration. How much gain do you expect from switching to a database?

        How much familiarity with SQL and database functions do the database aficionados expect from the OP? Is this actually helping or is this saying "look how smart I am!"?

        Also, when your management likes the output you just produced, they're going to ask for more and more analytics.
        I can confirm that from my own experience. But then, management doesn't ask for a 260GB CSV file, they usually want "two or three slides". One of my most successful Perl programs fell into that category. The evaluation ran once per week for several years. It might have been using a database but it didn't. Actually, no one cared.
      > How would a database compensate for this?

      Databases use structures like B-trees for indexing and sorting and are pretty good in balancing these trees between RAM and disk. °

      They optimize the trade-off between memory and time.

      Perl is pretty bad in sorting linear stuff which is part in RAM and part on disk.

      (Well you could tie an AoA to a file representing a table. Sorting that array would go without much RAM but mean constant overhead with each access. This is certainly worth a shot, but as I said DBs have this already optimized.)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      update

      °) from WP B-tree

        Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as disks ...

        ... for the purpose of efficiently managing index pages for large random access files. The basic assumption was that indexes would be so voluminous that only small chunks of the tree could fit in main memory.

        > (Well you could tie an AoA to a file representing a table. Sorting that array would go without much RAM but mean constant overhead with each access. This is certainly worth a shot, but as I said DBs have this already optimized.)

        I don't think this will work because of the way merge sort is implemented in Perl

        It is first comparing all pairs in a breadth first search, which would mean far too many disk operations and no sensible caching option

        DB<7> sort { print "$a,$b |"; $a <=> $b } reverse 1..16 16,15 |14,13 |12,11 |10,9 |8,7 |6,5 |4,3 |2,1 |15,13 |15,14 +|11,9 |11,10 |13,9 |13,10 |13,11 |13,12 \ |7,5 |7,6 |3,1 |3,2 |5,1 |5,2 |5,3 |5,4 |9,1 |9,2 |9,3 |9,4 + |9,5 |9,6 |9,7 |9,8 | DB<8>

        explained

        16,15 |14,13 |12,11 |10,9 |8,7 |6,5 |4,3 |2,1 | + # sort pairs 15,13 |15,14 |11,9 |11,10 + # sort 1st and 2nd 4-tuple |13,9 |13,10 |13,11 |13,12 + # sort 1st 8 tuple |7,5 |7,6 |3,1 |3,2 + # sort 3rd and 4th 4-tuple |5,1 |5,2 |5,3 |5,4 + # sort 2nd 8tuple |9,1 |9,2 |9,3 |9,4 |9,5 |9,6 |9,7 |9,8 | + # sort 16 tuple

        A caching/memoizing of data read from disk would be far more efficient if the chunks were strictly chosen depth first.

        FWIW WP lists some divide-and-conquer approaches for merge-sort

        https://en.wikipedia.org/wiki/Mergesort#Parallel_merge_sort

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

    A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11135110]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (4)
As of 2024-04-24 20:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found