Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Rosetta Code: Long List is Long

by Anonymous Monk
on Dec 02, 2022 at 00:22 UTC ( [id://11148487]=note: print w/replies, xml ) Need Help??


in reply to Rosetta Code: Long List is Long

Maybe it is not worth anything, but just exploiting a loophole of ridiculous 6 bytes a..z fixed length keys, and count fitting single byte. $r is just to quickly convert decimal to/from base-27, nothing more. Sure it all could be optimized. Excluding 0 from base-27 representation (i.e. using 27 instead of 26) is to avoid padding with leading zeroes. It follows that $buf_in length is wasteful 387,420,489 instead of 308,915,776, but, what the heck, I can be royally wasteful with this implementation. Decrementing count down from 255 instead of just incrementing from 0 is further silly optimization which I see this late time of day, but sure oversee better improvements. + Of course $MAX is 27**6 (or would be 26**6).

Can't confirm what eyepopslikeamosquito says about memory, with original llil.pl I see Working Set goes to approx. 2.9 GB. Mine doesn't exceed ~530 Mb.

llil start get_properties : 15 secs sort + output : 103 secs total : 118 secs my_test start get_properties : 31 secs sort + output: 10 secs total: 41 secs

...

use strict; use warnings; use feature 'say'; use Math::GMPz ':mpz'; use Sort::Packed 'sort_packed'; @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my $r = Rmpz_init; Rmpz_set_str( $r, 'zzzzzz' =~ tr/a-z1-9/1-9a-z/r, 27 ); my $MAX = Rmpz_get_ui( $r ); my ( $buf_in, $buf_out ) = ( "\xFF" x $MAX, '' ); for my $fname ( @llil_files ) { open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while ( <$fh> ) { chomp; my ( $word, $count ) = split /\t/; $word =~ tr/a-z1-9/1-9a-z/; Rmpz_set_str( $r, $word, 27 ); vec( $buf_in, Rmpz_get_ui( $r ), 8 ) -= $count; } close( $fh ) or die "error: close '$fname': $!"; } while ( $buf_in =~ /[^\xFF]/g ) { Rmpz_set_ui( $r, @- ); $buf_out .= pack 'aa6', $&, Rmpz_get_str( $r, 27 ) =~ tr/1-9a-z/a- +z/r } my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed C7 => $buf_out; while ( $buf_out ) { my ( $count, $word ) = unpack 'Ca6', substr $buf_out, 0, 7, ''; printf "%s\t%d\n", $word, 255 - $count } my $tend2 = time; warn "sort + output: ", $tend2 - $tstart2, " secs\n"; warn "total: ", $tend2 - $tstart1, " secs\n";

What follows is fiction, not implemented in code, can be ignored. I said 'ridiculous' above, but in fact I do remember original LLIL thread, not sure now but then I thought keys were expected significantly longer than qw/foo bar aaaaaa/, etc. (genetic sequences?). So then this would be multi-GB total of input files which are mostly keys, just keeping them keys in RAM is out of the question. Not to mention building and keeping hashes and working with them.

I thought about HQ hashing (xxHash?) of keys and sparsely storing (Judy?) values indexed by produced integer, where values are e.g. 64-bit-packed integer comprised of

  • file id
  • offset of start of line containing unique key first seen (tell)
  • count (updated as files are read in)

After all files are consumed, value positions within array (i.e. indexes) are no longer important. IF densely-packed array data (i.e. discard zero values) fits RAM, then problem is solved. Sort packed data, and produce output, which, yes, would require randomly reading lines (i.e. real keys) from input files AGAIN based on stored file id and line position.

Replies are listed 'Best First'.
Re^2: Rosetta Code: Long List is Long
by Anonymous Monk on Dec 08, 2022 at 13:59 UTC

    For reference, results of llil2d.pl (11148585) for my hardware are:

    llil2d start get_properties : 10 secs sort + output : 21 secs total : 31 secs 2317524 Kbytes of RAM were used

    (a report about resident RAM size was produced with same few lines of code as in script below)

    In fact, "words" are so short (and few) that I realized (later, after I wrote 11148660 with results there) that I don't need any indirection and can simply use Judy::HS: "words" are kept in RAM all the time, both in my in-memory flat "DB" and, opaquely, somewhere in Judy code.

    And then there's solution in Perl, which is both faster and requires much less memory, and generates identical output:

    my_test start get_properties : 13 secs sort + output : 7 secs total : 20 secs 349124 Kbytes of RAM were used

    Short integer to keep count is not required with this example, it only demonstrates count can be any length and not limited to a byte as in my previous "solution" in this thread. Same about 10 bytes to pad "words" to: it can be anything to fit longest word; words can be different in length.

    use strict; use warnings; use Judy::HS qw/ Set Get Free /; use Sort::Packed 'sort_packed'; my $DATA_TEMPLATE = 'nZ10'; my $DATA_SIZE = 12; my $COUNT_SIZE_BYTES = 2; my $COUNT_SIZE_BITS = 16; my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 ); @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my ( $data, $current ) = ( '', 0 ); my $judy; for my $fname ( @llil_files ) { open( my $fh, '<', $fname ) or die $!; while ( <$fh> ) { chomp; my ( $word, $count ) = split /\t/; ( undef, my $val ) = Get( $judy, $word ); if ( defined $val ) { vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES, $COUNT_SIZE_BITS ) -= $count } else { $data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $word; Set( $judy, $word, $current ); $current ++ } } } Free( $judy ); my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed "C$DATA_SIZE", $data; while ( $data ) { my ( $count, $word ) = unpack $DATA_TEMPLATE, substr $data, 0, $DA +TA_SIZE, ''; printf "%s\t%d\n", $word, $COUNT_MAX - $count } my $tend2 = time; warn "sort + output : ", $tend2 - $tstart2, " secs\n"; warn "total : ", $tend2 - $tstart1, " secs\n"; use Memory::Usage; my $m = Memory::Usage-> new; $m-> record; warn $m-> state-> [0][3], " Kbytes of RAM were used\n";

    What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.

    So I'm returning to my idea of keeping only hashes and offsets within files. Results I posted in 11148660 are of course valid, I was using 64-bit hashes as Judy::L integer keys.

    In fact 32-bit hashes generated with xxHash have a few collisions for 10e6 6-character words. 64-bit hashes have no collisions. I'm not qualified to predict how safe it is to expect no collisions for set of which size. Maybe 128-bit hashes, below, are overkill.

    Just for entertainment I decided to write "indirect" solution with 128-bit hashes. Therefore I'm not using Judy::L, but same Judy::HS with keys being 32-char hex-encoded hashes. Otherwise, almost same plan and code. Data template layout chosen arbitrarily and can be adjusted.

    Of course, words are not alphabetically sorted in output, but OTOH it was not original LLIL requirement. Results:

    my_test start get_properties : 21 secs sort + output : 23 secs total : 44 secs 841880 Kbytes of RAM were used

    I think same amount of RAM would be used if words were not 6 but 600 or 6000 bytes. Relatively fast 2nd phase (considering huge amount of random reads) is due to NMVe storage here.

    (Off topic: I managed to install Crypt::xxHash under Windows/Strawberry, but Judy was too much challenge for me. I wrote solution very close to code below, using not Judy, but very thin Inline::C wrapper for AVL library (google for jsw_avltree). It uses approx. same RAM and ~2x time for 1st phase. But also ~2.5x time for 2nd phase, which is exactly same code with same hardware (dual boot). Don't know if Windows I/O is just so much slower)

    use strict; use warnings; use Judy::HS qw/ Set Get Free /; use Crypt::xxHash 'xxhash3_128bits_hex'; use Sort::Packed 'sort_packed'; my $DATA_TEMPLATE = 'nnNn'; # word count # file index # word position # word length my $DATA_SIZE = 10; my $COUNT_SIZE_BYTES = 2; my $COUNT_SIZE_BITS = 16; my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 ); @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; warn "my_test start\n"; my $tstart1 = time; my ( $data, $current ) = ( '', 0 ); my $judy; for my $idx ( 0 .. $#llil_files ) { open( my $fh, '<', $llil_files[ $idx ]) or die $!; until ( eof $fh ) { my $pos = tell $fh; $_ = <$fh>; chomp; my ( $word, $count ) = split /\t/; my $xx = xxhash3_128bits_hex( $word, 0 ); ( undef, my $val ) = Get( $judy, $xx ); if ( defined $val ) { vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES, $COUNT_SIZE_BITS ) -= $count } else { $data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, $idx, $pos, length $word; Set( $judy, $xx, $current ); $current ++ } } } Free( $judy ); my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed "C$DATA_SIZE", $data; my @fh; open $fh[ $_ ], '<', $llil_files[ $_ ] for 0 .. $#llil_files; while ( $data ) { my ( $count, $idx, $pos, $len ) = unpack $DATA_TEMPLATE, substr $data, 0, $DATA_SIZE, ''; sysseek $fh[ $idx ], $pos, 0; sysread $fh[ $idx ], my( $word ), $len; printf "%s\t%d\n", $word, $COUNT_MAX - $count } my $tend2 = time; warn "sort + output : ", $tend2 - $tstart2, " secs\n"; warn "total : ", $tend2 - $tstart1, " secs\n"; use Memory::Usage; my $m = Memory::Usage-> new; $m-> record; warn $m-> state-> [0][3], " Kbytes of RAM were used\n";

      Thank you, Anonymous Monk! I tried-and-enjoyed your solution to include doubling the data size e.g. 32-bit count and 20 key length.

      $ diff llil_judyhs1.pl llil_judyhs2.pl 7,10c7,10 < my $DATA_TEMPLATE = 'nZ10'; < my $DATA_SIZE = 12; < my $COUNT_SIZE_BYTES = 2; < my $COUNT_SIZE_BITS = 16; --- > my $DATA_TEMPLATE = 'NZ20'; > my $DATA_SIZE = 24; > my $COUNT_SIZE_BYTES = 4; > my $COUNT_SIZE_BITS = 32;
      $ time perl llil_judyhs1.pl big1.txt big2.txt big3.txt >out1.txt my_test start get_properties : 10 secs sort + output : 5 secs total : 15 secs 353468 Kbytes of RAM were used real 0m14.770s user 0m14.669s sys 0m0.084s $ time perl llil_judyhs2.pl big1.txt big2.txt big3.txt >out2.txt my_test start get_properties : 10 secs sort + output : 5 secs total : 15 secs 473784 Kbytes of RAM were used real 0m15.073s user 0m14.938s sys 0m0.119s

        I also ran parallel using MCE; 7 workers, each processing a range of characters.

        #!/usr/bin/env perl # https://perlmonks.org/?node_id=11148669 use warnings; use strict; use Judy::HS qw/ Set Get Free /; use Sort::Packed 'sort_packed'; use MCE; my $DATA_TEMPLATE = 'nZ10'; my $DATA_SIZE = 12; my $COUNT_SIZE_BYTES = 2; my $COUNT_SIZE_BITS = 16; my $COUNT_MAX = int( 2 ** $COUNT_SIZE_BITS - 1 ); @ARGV or die "usage: $0 file...\n"; my @llil_files = @ARGV; for (@llil_files) { die "Cannot open '$_'" unless -r "$_"; } # MCE gather and parallel routines. my $DATA = ''; sub gather_routine { $DATA .= $_[0]; } sub parallel_routine { my $char_range = $_; my ( $data, $current, $judy ) = ( '', 0 ); for my $fname (@llil_files) { open( my $fh, '<', $fname ) or die $!; while ( <$fh> ) { if (/^[${char_range}]/) { chomp; my ( $word, $count ) = split /\t/; ( undef, my $val ) = Get( $judy, $word ); if ( defined $val ) { vec( $data, $val * $DATA_SIZE / $COUNT_SIZE_BYTES, $COUNT_SIZE_BITS ) -= $count } else { $data .= pack $DATA_TEMPLATE, $COUNT_MAX - $count, + $word; Set( $judy, $word, $current ); $current ++ } } } close $fh; } Free( $judy ); MCE->gather( $data ); } # Run parallel using MCE. warn "my_test start\n"; my $tstart1 = time; MCE->new( input_data => ['a-d','e-h','i-l','m-p','q-t','u-x','y-z'], max_workers => 7, chunk_size => 1, posix_exit => 1, gather => \&gather_routine, user_func => \&parallel_routine, use_threads => 0, )->run(1); my $tend1 = time; warn "get_properties : ", $tend1 - $tstart1, " secs\n"; my $tstart2 = time; sort_packed "C$DATA_SIZE", $DATA; $| = 0; # enable output buffering while ( $DATA ) { my ( $count, $word ) = unpack $DATA_TEMPLATE, substr $DATA, 0, $DA +TA_SIZE, ''; printf "%s\t%d\n", $word, $COUNT_MAX - $count } my $tend2 = time; warn "sort + output : ", $tend2 - $tstart2, " secs\n"; warn "total : ", $tend2 - $tstart1, " secs\n"; __END__ $ time perl mce_judyhs.pl big1.txt big2.txt big3.txt >out3.txt my_test start get_properties : 5 secs sort + output : 5 secs total : 10 secs real 0m9.794s user 0m35.719s sys 0m0.257s

      Thanks anonymonk. Excellent work!

      Though I've never used them, I've heard good things about Judy Arrays and maintain a list of references on them at PM. Might get around to actually using them one day. :)

      What if "words" are significantly longer? With approx. 10e6 unique words in this test, if they were each hundreds of bytes, then several GB of RAM would be used just to keep them. Perhaps impractical.

      Good question! Apologies, my initial test file generator was very primitive. To try to help answer your question I've quickly whipped up a test file generator that generates much longer keys (up to around 200 characters in length) and longer counts too. I was conservative with the counts because I didn't want to disqualify folks using 32-bit ints.

      # gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; open( my $fh_out, '>', $fname ); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count'\ +n"; create_long_test_file( $outfile, $count ); print "file size=", -s $outfile, "\n";

      I ran it like this:

      > perl gen-long-llil.pl long1.txt 600 generating short long test file 'long1.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long2.txt 600 generating short long test file 'long2.txt' with count '600' file size=65616000 > perl gen-long-llil.pl long3.txt 600 generating short long test file 'long3.txt' with count '600' file size=65616000

      Then reran my two biggish benchmarks with a mixture of files:

      > perl llil2d.pl big1.txt big2.txt big3.txt long1.txt long2.txt long3. +txt >perl2.tmp llil2d start get_properties : 11 secs sort + output : 23 secs total : 34 secs > llil2a big1.txt big2.txt big3.txt long1.txt long2.txt long3.txt >cpp +2.tmp llil2 start get_properties : 6 secs sort + output : 5 secs total : 11 secs > diff cpp2.tmp perl2.tmp

      Improved test file generators welcome.

      Updated Test File Generators

      These were updated to allow a "\n" (rather than "\r\n") on Windows after this was pointed out here. Curiously, \n seems to be slower than \r\n on Windows if you don't set binmode! I am guessing that chomp is slower with \n than with \r\n on a Windows text stream.

      gen-llil.pl

      # gen-llil.pl # Crude program to generate a big LLiL test file to use in benchmarks # On Windows running: # perl gen-llil.pl big2.txt 200 3 - produces a test file with size + = 35,152,000 bytes # (lines terminated with "\r\n") # perl gen-llil.pl big2.txt 200 3 1 - produces a test file with size + = 31,636,800 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +31,636,800 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } sub create_test_file { my $fname = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); for my $c ( 'aaa' .. 'zzz' ) { for my $i (1 .. $count) { print {$fh_out} gen_random_word( $c, $wordlen ) . "\t" . 1 . +"\n"; } } } my $outfile = shift; my $count = shift; my $wordlen = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count wordlen\n"; $count or die "usage: $0 outfile count wordlen\n"; print "generating test file '$outfile' with count '$count' (binmode=$f +bin)\n"; create_test_file($outfile, $count, $wordlen, $fbin); print "file size=", -s $outfile, "\n";

      gen-long-llil.pl

      # gen-long-llil.pl # Crude program to generate a LLiL test file with long names and count +s # perl gen-long-llil.pl long1.txt 600 # On Windows running: # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,616,000 bytes # (lines terminated with "\r\n +") # perl gen-long-llil.pl long1.txt 600 - produces a test file with s +ize = 65,107,200 bytes # (lines terminated with "\n") # On Unix, lines are terminated with "\n" and the file size is always +65,107,200 bytes use strict; use warnings; use autodie; { my $ordmin = ord('a'); my $ordmax = ord('z') + 1; # Generate a random word sub gen_random_word { my $word = shift; # word prefix my $nchar = shift; # the number of random chars to append for my $i (1 .. $nchar) { $word .= chr( $ordmin + int( rand($ordmax - $ordmin) ) ); } return $word; } } my $longworda = join '', 'a' .. 'z'; my $longwordz = join '', reverse('a' .. 'z'); my $longcount = 1_000_000; sub create_long_test_file { my $fname = shift; my $howmany = shift; my $fbin = shift; open( my $fh_out, '>', $fname ); $fbin and binmode($fh_out); # Some with no randomness for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; print {$fh_out} "$worda\t$cnt\n$wordz\t$cnt\n"; } } # Some with randomness my $wordlen = 1; for my $h ( 1 .. $howmany ) { for my $i ( 1 .. 8 ) { my $cnt = $longcount + $i - 1; my $worda = $longworda x $i; my $wordz = $longwordz x $i; for my $c ( 'a' .. 'z' ) { for my $z ( 1 .. 2 ) { print {$fh_out} $worda . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; print {$fh_out} $wordz . gen_random_word( $c, $wordlen +) . "\t" . (1000000 + $z) . "\n"; } } } } } my $outfile = shift; my $count = shift; my $fbin = shift; # default is to use text stream (not a binary +stream) defined($fbin) or $fbin = 0; $outfile or die "usage: $0 outfile count\n"; $count or die "usage: $0 outfile count\n"; $count =~ /^\d+$/ or die "error: count '$count' is not a number\n"; print "generating short long test file '$outfile' with count '$count' +(binmode=$fbin)\n"; create_long_test_file( $outfile, $count, $fbin ); print "file size=", -s $outfile, "\n";

      Updated this node with new test file generators so you can generate test files that are the same size on Unix and Windows. That is, by setting $fbin you can make the line ending "\n" on Windows, instead of "\r\n". See Re^2: Rosetta Code: Long List is Long (faster) for more background.

        Thanks for paying attention to my doubts, perhaps I wasn't very clear. What I meant was total length of unique words i.e. hash keys. Would be roughly equal to size of output file, which is almost the same for both original test and parent node. I don't think it's worth the effort to create simulation with a few GB output file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148487]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-04-19 04:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found