http://qs321.pair.com?node_id=11115208


in reply to Re: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Hi vr,

Wow! Curiosity moved me to try with MCE. But first, I optimized Nick's code. Also, workers now send only the relevant data and not the entire list. This decreases the time needed for the manager process to output results.

Update 1: Slight optimization to Choroba's example (i.e. >> 1). Added collatz_compress.
Update 2: Added collatz_count.
Update 3: Improved collatz_count.
Update 4: Added Inline C times.

Demo choroba, 1nickt, compress and count all in one.

use strict; use warnings; use feature 'say'; use MCE::Flow; sub collatz_choroba { my ($n) = @_; my @seq = $n; push @seq, ( $seq[-1] >> 1, 3 * $seq[-1] + 1 )[ $seq[-1] % 2 ] while $seq[-1] != 1; return scalar @seq; } sub collatz_1nickt { my ($n) = @_; my @seq = $n; push @seq, $n = $n % 2 ? 3 * $n + 1 : $n >> 1 while $n != 1; return scalar @seq; } sub collatz_compress { my ($n) = @_; my @seq = $n; # Notation and compressed dynamics (2:15 into the video) # https://www.youtube.com/watch?v=t1I9uHF9X5Y # consider the map # C(x) = $n % 2 ? 3 * $n + 1 : $n / 2 # compress the map to the dynamically equivalent # T(x) = $n % 2 ? (3 * $n + 1) / 2 : $n / 2 # compress loop iteration when odd sequence push @seq, $n % 2 ? ( $n = 3 * $n + 1, $n = $n >> 1 ) : ( $n = $n >> 1 ) while $n != 1; return scalar @seq; } sub collatz_count { my ($n) = @_; my $steps = 1; # count using the T(x) notation $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1; return $steps; } #*collatz = \&collatz_choroba; # choose collatz here #*collatz = \&collatz_1nickt; #*collatz = \&collatz_compress; *collatz = \&collatz_count; my @sizes; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), gather => \@sizes, bounds_only => 1, ); mce_flow_s sub { my ($start, $end) = @{ $_[1] }; my @ret = map { [ $_, collatz($_) ] } $start .. $end; MCE->gather( ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] ); }, 1, 1e6; MCE::Flow->finish; say "@$_" for ( sort { $b->[1] <=> $a->[1] } @sizes )[ 0..19 ];

Demo caching by vr

use strict; use warnings; use feature 'say'; use MCE::Flow; my %cache = ( 1 => [ 1, 1, undef ]); # # [ seq_length, start_num (same as key), next_num ] # sub collatz_vr { my $n = shift; return if exists $cache{ $n }; my ( @tmp, $new_n ); while ( !exists $cache{ $n } ) { $new_n = $n % 2 ? 3 * $n + 1 : $n >> 1; push @tmp, $cache{ $n } = [ -$#tmp, $n, $new_n ]; $n = $new_n } my $d = $#tmp + $cache{ $n }[ 0 ]; $_->[ 0 ] += $d for @tmp; } my @sizes; mce_flow_s { max_workers => MCE::Util::get_ncpu(), gather => \@sizes, bounds_only => 1, }, sub { my ($start, $end) = @{ $_[1] }; collatz_vr($_) for $start .. $end; MCE->gather( ( sort { $b->[0] <=> $a->[0] } @cache{ $start .. $end } )[ 0 . +. 19 ] ); }, 1, 1e6; MCE::Flow->finish; say "@$_[ 1, 0 ]" for ( sort { $b->[0] <=> $a->[0] } @sizes )[ 0 .. 19 ];

Run time on a 32 core machine - SMT disabled

The times include launching Perl, loading modules, spawning workers, reaping workers, and output (~ 0.100 seconds). See the next post for the Inline C demonstration.

max_workers => 1 collatz_vr 3.617s 1.0x collatz_choroba 14.896s 1.0x collatz_1nickt 12.264s 1.0x collatz_compress 8.932s 1.0x collatz_count 7.021s 1.0x collatz_count_c1 0.741s 1.0x Inline C collatz_count_c2 0.625s 1.0x Inline C max_workers => 2 collatz_vr 2.707s 1.3x collatz_choroba 7.538s 2.0x collatz_1nickt 6.141s 2.0x collatz_compress 4.490s 2.0x collatz_count 3.537s 2.0x collatz_count_c1 0.398s 1.9x Inline C collatz_count_c2 0.339s 1.8x Inline C max_workers => 4 collatz_vr 1.906s 1.9x collatz_choroba 3.948s 3.8x collatz_1nickt 3.274s 3.7x collatz_compress 2.365s 3.8x collatz_count 1.873s 3.7x collatz_count_c1 0.235s 3.2x Inline C collatz_count_c2 0.203s 3.1x Inline C max_workers => 8 collatz_vr 1.458s 2.5x collatz_choroba 1.995s 7.5x collatz_1nickt 1.655s 7.4x collatz_compress 1.209s 7.4x collatz_count 0.961s 7.3x collatz_count_c1 0.149s 5.0x Inline C collatz_count_c2 0.132s 4.7x Inline C max_workers => 16 collatz_vr 0.976s 3.7x collatz_choroba 1.034s 14.4x collatz_1nickt 0.858s 14.3x collatz_compress 0.637s 14.0x collatz_count 0.509s 13.8x collatz_count_c1 0.114s 6.5x Inline C collatz_count_c2 0.105s 6.0x Inline C max_workers => 32 collatz_vr 0.837s 4.3x collatz_choroba 0.593s 25.1x collatz_1nickt 0.497s 24.7x collatz_compress 0.379s 23.6x collatz_count 0.310s 22.6x collatz_count_c1 0.112s 6.6x Inline C collatz_count_c2 0.108s 5.8x Inline C

Interesting. For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling versus the non-cache demonstrations.

collatz_count (2 cores) reaches collatz_vr (1 core).
collatz_count (4 cores) reaches collatz_vr (4 cores).

Kind regards, Mario

Replies are listed 'Best First'.
Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (Inline C)
by marioroy (Prior) on Apr 11, 2020 at 04:37 UTC

    Hi,

    There was one thing left to try and that is using Inline::C. I captured the top 20 for 1e9 and 1e10.

    Update 1: Set chunk_size to reduce IPC. Tried staying within CPU cache by keeping @ret small. 1e10 now completes in 4 minutes (20% improvement).

    Update 2: Added collatz_count_c2 using GCC compiler intrinsics.

    use strict; use warnings; use feature 'say'; use MCE::Flow; use Inline C => Config => CCFLAGSEX => '-O2 -fomit-frame-pointer', clean_after_build => 0; use Inline C => <<'END_OF_C_CODE'; #include <stdlib.h> #include <stdint.h> #if defined(_WIN32) #define strtoull _strtoui64 #endif void collatz_count_c1( SV* _n ) { uint64_t n, steps = 1; /* include 1 */ Inline_Stack_Vars; #ifdef __LP64__ n = SvUV( _n ); #else n = strtoull( SvPV_nolen( _n ), NULL, 10 ); #endif /* count using the T(x) notation */ do { n % 2 ? ( steps += 2, n = (3 * n + 1) >> 1 ) : ( steps += 1, n = n >> 1 ); } while ( n != 1 ); Inline_Stack_Reset; Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) ); Inline_Stack_Done; } void collatz_count_c2( SV* _n ) { uint64_t n, steps = 1; /* include 1 */ Inline_Stack_Vars; #ifdef __LP64__ n = SvUV( _n ); #else n = strtoull( SvPV_nolen( _n ), NULL, 10 ); #endif /* based on GCC compiler intrinsics demonstration by Alex Shirley +*/ /* https://stackoverflow.com/questions/38114205/c-collatz-conjectu +re-optimization */ /* https://www.go4expert.com/articles/builtin-gcc-functions-builti +nclz-t29238 */ if ( n % 2 == 0 ) { steps += __builtin_ctz(n); /* account for all evens */ n >>= __builtin_ctz(n); /* always returns an odd */ } /* when we enter we're always working on an odd number */ do { n = 3 * n + 1; steps += __builtin_ctz(n) + 1; /* account for odd and even */ n >>= __builtin_ctz(n); /* always returns an odd */ } while ( n != 1 ); Inline_Stack_Reset; Inline_Stack_Push( sv_2mortal( newSVuv( steps ) ) ); Inline_Stack_Done; } END_OF_C_CODE sub collatz_count { my ($n) = @_; my $steps = 1; # count 1 # count using the T(x) notation $n % 2 ? ( $steps += 2, $n = (3 * $n + 1) >> 1 ) : ( $steps += 1, $n = $n >> 1 ) while $n != 1; return $steps; } no warnings 'once'; #*collatz = \&collatz_count; # choose collatz here #*collatz = \&collatz_count_c1; # using the T(x) notation *collatz = \&collatz_count_c2; # using compiler intrinsics my $m = shift || 1e6; my ( @sizes, $chunk_size ); $chunk_size = int( $m / MCE::Util::get_ncpu() / 40 + 1 ); $chunk_size += 1 if $chunk_size % 2; MCE::Flow->init( max_workers => MCE::Util::get_ncpu(), chunk_size => $chunk_size, gather => \@sizes, bounds_only => 1, ); mce_flow_s sub { my ( $start, $end ) = @{ $_[1] }; my @ret; for ( $start .. $end ) { push @ret, [ $_, collatz($_) ]; @ret = ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] if @ret > 100; } ( @ret > 20 ) ? MCE->gather( ( sort { $b->[1] <=> $a->[1] } @ret )[ 0..19 ] +) : MCE->gather( @ret ); }, 1, $m; MCE::Flow->finish; say "@$_" for ( sort { $b->[1] <=> $a->[1] } @sizes )[ 0..19 ];

    Output

    1..1e9 collatz_count 6m10.877s 32 cores Perl collatz_count_c1 11m20.134s 1 core Inline C collatz_count_c2 8m56.683s 1 core Inline C collatz_count_c1 0m23.443s 32 cores 29.0x collatz_count_c2 0m18.507s 32 cores 29.0x 670617279 987 848749995 977 954843745 972 954843751 972 716132809 969 537099606 966 537099607 966 268549803 965 805649409 964 805649410 964 805649411 964 805649415 964 402824705 963 604237057 961 604237058 961 604237059 961 302118529 960 906355586 959 906355587 959 906355588 959 1..1e10, 32 cores, Inline C collatz_count_c1 4m00.323s collatz_count_c2 3m09.774s 9780657630 1133 9780657631 1133 4890328815 1132 7335493223 1130 6189322407 1122 9283983611 1120 8140184737 1094 6105138553 1091 9157707830 1089 9157707831 1089 4578853915 1088 6868280873 1086 5151210655 1083 7726815983 1081 9282648843 1058 6961986633 1055 5221489974 1052 5221489975 1052 2610744987 1051 7832234962 1050 1..1e11, 32 cores, Inline C collatz_count_c1 40m50.177s collatz_count_c2 32m17.319s 75128138247 1229 84519155529 1224 85833820801 1224 63389366646 1221 63389366647 1221 64375365601 1221 31694683323 1220 95084049969 1219 95084049970 1219 95084049971 1219 96563048402 1219 96563048403 1219 47542024985 1218 48281524201 1218 71313037476 1216 71313037477 1216 71313037478 1216 71313037479 1216 72422286302 1216 72422286303 1216

    Look at C go!

    Regards, Mario

Re^3: Optimizing with Caching vs. Parallelizing (MCE::Map) (Better caching?)
by vr (Curate) on Apr 08, 2020 at 21:40 UTC

    Hi marioroy,

    I think I'm having an issue with MCE on Windows here, timing your code (collatz_vr):

    use Time::HiRes 'time'; my $t = time; # the whole script here say time - $t; MCE::Flow->finish; say time - $t; __END__ 1 worker: 6.01482510566711 7.76495599746704 2 workers: 4.12953305244446 7.07751798629761 4 workers: 3.33010196685791 8.4802930355072

    1st measurement approximately matches your output, but 1.5 - 2 seconds per worker to shutdown doesn't look OK to me.

    For vr's demo, every worker starts with an empty cache. Meaning that workers do not have cached results from prior chunks. This is the reason not scaling as well versus the non-cache demonstrations

    In other words, lots and lots of work is needlessly duplicated.

    _cache_collatz( $_ ) for 1 .. 1e6; say scalar %cache; %cache = (); _cache_collatz( $_ ) for 1 + 4e5 .. 1e6; say scalar %cache; __END__ 2168611 2168611

    So, effectively, in situation with e.g. 10 workers, 4 junior workers, filling cache for ranges up to 400_000, are free to slack off and not gather their results at all, and there's large amount of overlap in work of 6 senior workers, too. For now, I have no solid idea how to parallelize this algorithm efficiently.

      Hi, vr

      The 1.5 - 2 seconds is likely threads freeing memory. Change my %cache to our %cache. MCE defaults to spawning threads on Windows. Try passing the MCE option use_threads => 0 to ensure emulated child instead.

      our %cache = ( 1 => [ 1, 1, undef ]); # for faster global cleanup mce_flow_s { max_workers => MCE::Util::get_ncpu(), gather => \@sizes, bounds_only => 1, use_threads => 0, }, sub { … }, 1, 1e6;