Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

In-place sort with order assignment

by BrowserUk (Patriarch)
on Sep 19, 2010 at 14:12 UTC ( [id://860699]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

I have a fairly large hash (keys, no values), and I want to assign numeric values in the sort order of the keys. Eg. hash contains A .. F, I want to assign 1 to A, 2 to B etc.

For a small hash this is trivial. Sort the keys into an array, and then use that array in a hash slice to assign the values:

undef $hash{ $_ } for 'a' .. 'f';; @ordered = sort keys %hash;; @hash{ @ordered } = 1 .. @ordered;; pp \%hash;; { a => 1, b => 2, c => 3, d => 4, e => 5, f => 6 }

But for a large hash--millions of keys--that will potentially push my memory consumption into swapping. If not during the sorting, then during the hash slice assignment.

Any thoughts on how this can be done without breaking the memory bank? (Speed is not a huge priority here.)


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: In-place sort with order assignment
by Corion (Patriarch) on Sep 19, 2010 at 14:29 UTC

    I don't see a really perlish way to do this, so my approach would be to repeatedly sort the data using heaps, first taking the bottom n elements, then the next n elements. This can be done relatively efficiently using heaps, except that you have to sort the data keys %hash / n times. I think you can optimize the second passes onwards by caching the offset items at positions n, 2*n and so on, so you get to discard items that are "too high" earlier on. I'm not sure whether not inserting items that will get discarded anyway will save you much time.

    Alternatively, you could look at whether your external sort program can handle the data size, then I'd write out the keys using each and read them back in using while (<$fh>) {.

    Which tradeoff is faster will really depend, because with the first approach, you're basically sorting your data multiple times, while with the second approach, your external sort program needs a strategy to sort the data.

Re: In-place sort with order assignment
by Limbic~Region (Chancellor) on Sep 19, 2010 at 14:21 UTC
    BrowserUk,
    In your example, you have created two large lists. Perhaps you could get away with just one:
    my $cnt = 0; for (sort keys %hash) { $hash{$_} = ++$cnt; }
    Now, assuming that doesn't work you can take a multi-pass approach using each (which consume almost no memory). Essentially, you have two data structures designed to maintain order (pick your poison). One the first pass, you keep track of the bottom N elements leaving the 2nd data structure empty. On the second pass, you start recording the next bottom N elements while assigning the values of the first N. On the 3rd pass they swap functions. Wash, rinse, repeat.

    Cheers - L~R

      The problem is, the statement sort keys %hash has already created the two large lists. The input list and the output list. After that, the hash slice assignment is just reusing space acquired during the sort.

        It appears to me (perhaps wrongly so) that the problem reduces down to how to print all keys of a hash without Perl making a new list of all keys of that hash in order for the print to work? Assumption is that doubling the size of the storage for key values will exceed physical memory.

        In a more general sense: how to call a sub for each hash entry as the hash table is traversed.

        If that can be done, then output all keys to a file (call a print routine for each entry). Call a system sort routine for that file. The Perl program may be paged out. Many files may be created and lots of memory may be consumed, but when that sort finishes, there is a file that is in hash key order and all the physical memory that system sort used is now free.

        Perl program reads that big file (millions of lines) and assigns sequential numbers to each entry.

        Why wouldn't that work?

        BrowserUk,
        You didn't comment on my multi-pass each approach. It is an ugly solution but there is no reason I can think of that it wouldn't work.

        Cheers - L~R

Re: In-place sort with order assignment
by sundialsvc4 (Abbot) on Sep 19, 2010 at 16:15 UTC

    For “millions of keys,” consider tieing it to, say, a Berkeley DB file.   For such a file, the each() function will return the values in ascending order by key, without the need for a separate sort.

    Include a field, sort_position, initially zero.   Then, after the structure is loaded, walk the keys and assign the values sequentially.

    Since you are, probably of-necessity, creating a persistent representation of the hash (an ISAM-file), try to structure your entire process to take advantage of that persistence.   The first run, which loads the file for the first time, would be the most expensive.

    Also consider ... doing it the COBOL way.   Instead of relying upon random-access data structures of any kind, use an external-sort to sort both the master file and any update-files by the same key.   Then, process the two streams sequentially, in so-called “unit record” fashion.   You just might be astounded at how much faster it works ...

Re: In-place sort with order assignment
by salva (Canon) on Sep 19, 2010 at 15:45 UTC
    # untested! my %h = (...); my @k; while (keys %h) { my $k = each %h; push @k, $k; delete $h{$k}; } @k = sort @k; my $ix = 0; while (@k) { $h{shift @k} = $ix++; }

      I tried that, and I think it might work, but it takes an extraordinary amount of time. So much so that I never actually saw it complete. To counter that, I tried extracting and deleting the keys in batches of 1000:

      my @ordered; while( my $n = keys %hash ) { my @temp; for ( 1 .. $n < 1000 ? $n : 1000 ) { push @temp, scalar each %hash; } delete @hash{ @temp }; push @ordered, @temp; }

      But it still takes an extraordinary amount of time.

      Update: The above has a precedence problem with the ternary expression! And results in for loop operating from 1 to 1.

      Add a set of parens and things go much more quickly:

      my @ordered; while( my $n = keys %hash ) { my @temp; for ( 1 .. ( $n < 1000 ? $n : 1000 ) ) { push @temp, scalar each %hash; } delete @hash{ @temp }; push @ordered, @temp; }

      BTW: If I ever manage to get the keys out of the hash and into an array without blowing memory, then I intend to use your Sort::Key to sort them in-place. But, there doesn't seem to be a an in-place version of keysort that doesn't take a callback?

      I can use keysort_inplace{ $_ } @ordered;, but doesn't the callback slow things down?

      I can use rsort_inplace @array and just assign the numbers reversed, but it seems there should be a sort_inplace() version?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BTW: If I ever manage to get the keys out of the hash and into an array without blowing memory, then I intend to use your Sort::Key to sort them in-place. But, there doesn't seem to be a an in-place version of keysort that doesn't take a callback?

        It seems the interface is not very consistent... actually, the function exists, it is Sort::Key::_sort_inplace.

        In any case, perl builtin sort will perform equally faster as @k = sort @k gets optimized by the interpreter into an inplace operation.

        BTW, using the quicksort algorithm instead of the default merge sort will also help to reduce memory usage.

Re: In-place sort with order assignment
by creamygoodness (Curate) on Sep 20, 2010 at 14:58 UTC
    You can use Sort::External for stuff like this.
    use Sort::External; my $sortex = Sort::External->new; while ( my ($k, $v) = each %hash) ) { # aliases, so no memory hit $sortex->feed($k); } $sortex->finish; my $counter = 1; while (defined($_ = $sortex->fetch)) { $hash{$_} = $counter++; }
Re: In-place sort with order assignment
by JavaFan (Canon) on Sep 20, 2010 at 13:13 UTC
    There's a simple, quadratic, solution: repeatedly find the smallest key that hasn't gotten a value yet. Give it the next value.

    But I would go a different way: write all the keys to a file, one key per line. Call sort(1). Read the sorted file. Assign $. to each value.

    And I'd look into finding the sort order at the moment you're creating the huge hash.

      And I'd look into finding the sort order at the moment you're creating the huge hash.

      The big hash is effectively uniq'ing millions from hundreds or thousands of millions of inputs. Those inputs are "words" extracted from "lines" read from an input file. The sort is to order those "words". There's no way to determine the ordering without sorting.

      But I would go a different way: write all the keys to a file, one key per line. Call sort(1). Read the sorted file. Assign $. to each value.

      Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?

      If so, you might be onto something.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Can you extract 10 million keys from a hash (without blowing memory); write them to a file; start an external process (without forcing the current process to be swapped out), that will read them from that file and write them to other files several times; before reading them back into memory from the file; all in 108 seconds?
        On my computer? No, I wouldn't even be able to read in 10 million keys without blowing memory.

        On your computer? I've no idea - I don't have access. You can try it out (after writing the code, you'll have your answer within 2 minutes). Or you can just dismiss out of hand, without knowing for sure. Your call - it's your problem after all.

        My first attempt would be to skip the hash altogether and just run the 'keys' straight through "/usr/bin/sort -u" as they are extracted. But perhaps the hash is needed for other than eliminating duplicates. (Yes, I have "sort -u" on Windows -- it is just one of many useful things that came with cygwin.)

        - tye        

Re: In-place sort with order assignment
by dasgar (Priest) on Sep 19, 2010 at 20:55 UTC

    Out of curiosity, when the "fairly large hash (keys, no values)" is created, are the keys being created in the correct "sorted" order?

    If the answer is yes, then I think that Tie::IxHash may be helpful in that will "preserve the order in which the hash elements were added", which eliminates the need to sort the keys after creating them.

    If the answer is no, then the module suggestion above probably wouldn't be much help. Not sure what to suggest for this scenario.

      are the keys being created in the correct "sorted" order?

      Unfortunately not. Otherwise I could just assign the values at the same time I create the keys.

      Also, the implementation of Tie::IxHash is such that it consumes ~3x times the memory of a standard hash.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: In-place sort with order assignment
by TomDLux (Vicar) on Sep 20, 2010 at 13:53 UTC

    Reading this made me think of Data Structures classes in university. In C, I would read the keys into a tree structure rather than a hash, then assign sequence numbers by doing an in-order traversal. But the extra pointers involved probably waste nearly as much space as the duplicate list, and rebalancing the tree burns a lot of CPU.

    The external sort is probably the best Perl solution.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: In-place sort with order assignment
by ikegami (Patriarch) on Sep 20, 2010 at 13:55 UTC

    I believe @a = sort @a; works in place. O(1N) memory, O(NN log N) time.

    my @keys; keys(%hash); # Reset iterator #while (my $k = each(%hash)) { # if 0 or '' are possible keys while (my $k = each(%hash)) { push @keys, $k; } @keys = sort @keys; for (0..$#keys) { $hash{$keys[$_]} = $_; }
      I believe @a = sort @a; works in place. O(1) memory, O(N) time.

      That isn't the (whole) problem. Creating the array from the keys of the hash effectively doubles memory usage, before you ever get to the sort.

      This has already been discussed in the thread.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Seems to me the very complex solutions presented use at least that much memory too, so consider it a simplification.
Re: In-place sort with order assignment
by changma_ha (Sexton) on Sep 20, 2010 at 04:48 UTC

    Hi BrowseUk.....I tried to solve your problem....may be i didn't get the the result exactly as u wanted....i hope my code will help you.

    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; my $count= 0; my $i; my %hash=( A=> [], B=> [], C=> [], ); foreach my $key (sort keys %hash){ $count++; for ($i=1;$i<=$count;$i++){ $hash{$key} =$i; } print Dumper \%hash; }
Re: In-place sort with order assignment
by sundialsvc4 (Abbot) on Sep 21, 2010 at 12:33 UTC

    Could I perchance repeat my suggestion of, “do it the COBOL way?”

    External disk-based sorting ... easily done in Perl ... can accommodate millions of records, and runs with counter-intuitive speed.   When this is done, to any data set, you know two things:

    1. All records having any particular key value are adjacent.
    2. Any time you encounter a gap in the key-sequence, you know that nothing’s in it, and you know exactly how big the gap is.   All without “searching.”

    Therefore, work with a sorted file, then also sort your input file(s) in the same sequence.   Having done so, you can now process the two files together, sequentially, without “searching.”

    In the early days of movies, anytime Hollywood wanted to film a real digital computer, the operators would start a tape-sort.   All those spinning tapes made wonderful footage.   You may or may not have noticed that they never slowed down or ran backwards.   They processed very large amounts of data, very efficiently, doing so without large amounts of RAM or random-access storage, neither one of which was available at the time.   You can expect “more than hundred-fold increases” in efficiency for many algorithms ... including the time spent doing the sorts, which is “far less time than you think.”

    Never forget that “memory is a disk-file, too.”

    “Don’t diddle the code to make it faster:   choose a better algorithm.”
    -- Kernighan & Plaugher; The Elements of Programming Style.

      Could I perchance repeat my suggestion of, “do it the COBOL way?”

      No. I read you the first time.

      If you read the rest of the thread, you'd see that option has already been tested and comprehensively dismissed as a possibility.

      As for all the "old-timer" logic. Been there, done that. (Don't ever wanna go back:).

      In the mid-80's I sped up the sorting of 6 million University entrance examination results on twinned PDP-11/24s from 10 days to just over a day (by reversing the order of the joins involved).

      Today, I've sorted 60% more records, in perl, in 108 seconds.

      I can't start work with a sorted file, because the input file needs to be pre-processed to extract the 1 billion words from the 100 million lines. Those 1 billion words need to be uniq'd to reduce their volume by roughly 3 orders of magnitude before sorting.

      I could (externally) sort the 1 billion words; then uniq them, but that means O(N log N) where N is 1 billion, rather than O(N log N) where N = 1 million. Now assuming that each record takes 0.0000113 to sort--that's based upon my Perl sort test, no external sort is going to achieve that--the trade of sorting before uniq'ing versus after is:

      1. 1e9 * log 1e9 * 0.0000113 = 28.25 hours.
      2. 1e6 * log 1e6 * 0.0000113 = 1 minute 7.8 seconds.

      Now, those per item timings are based upon an in-memory sort. To get a true picture we'd need to use figures from an external sort. So, I sorted 1 million items and the sort time was 0.00001535; then I sort 10 million items and the time was 0.00002198. That's a 43% increase in time per item, for ten times the data.

      So for 100 million, say 0.00003143. And for 1 billion say: 0.00004494. Which means externally sorting 1 billion items is likely to be closer to 5 days than 4. I hope you now see that uniq'ing before the sort is imperative.

      Just as not everything new is better than the old, vice versa is also true.

      It's simple physics. If you squeeze a large volume of data (~10GB), through a small hole (~1MB ram as used by GNU sort), you have to shuffle it on and off disc many times. "memory may be a disc too", but it's bloody fast compared to real disc. Conversely, disc makes bloody lousy memory.

      “Don’t diddle the code to make it faster: choose a better algorithm.”

      Uniq'ing then sorting in memory: 108 seconds.

      Pre-external sorting and then uniq'ing: 4.68125 days.

      Which do you see as the better algorithm?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        As I read through this post, a couple of my own posts popped into my head (What Is A Word? and How many words does it take?). When you say "word" and "line", do you mean englishish text? I would be pretty suprised to see a dictionary of 1+ million real words but I guess it is possible. If this process has to be repeated multiple times and if most, if not all, words from one run to the next will be seen again, then perhaps there is an even faster way of doing this (using a pre-built dictionary).

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://860699]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2024-04-24 22:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found