Re^5: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)

in reply to Re^4: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Greetings,

The prior demonstration involves sorting (qsorti) using one core, taking ~ 5 extra seconds for 1e8. That led me to try parallel sorting inside user_end. Code for both UNIX and Windows are provided.

Update 1: Output not 100% consistent. Not suited for parallelism due to cache miss. The non-PDL solutions here and here handle cache miss. But not yet here.

Update 2: Output now 100% consistent, possible with MCE::relay.

UNIX:

use strict;
use warnings;
use feature 'say';

BEGIN {
    # Does not work on Windows unfortunately.
    die "Sorry, this script requires a UNIX based OS, exiting...\n"
        if $^O eq 'MSWin32';
}

use PDL;
use File::Map; # ensure that Perl has File::Map before loading FastRaw
use PDL::IO::FastRaw;

use MCE::Signal '$tmp_dir';
use MCE::Flow;
use MCE::Candy;

{
    no warnings 'once'; $PDL::BIGPDL = 1;
    eval q{ PDL::set_autopthread_targ(1) };
}

use List::Util;
BEGIN { *_min = \&List::Util::min;          # collision
        *_max = \&List::Util::max }         # with PDL

use constant MAX    => shift || 1e7;
use constant TOP    => _min( 20, MAX );
use constant CHUNK  => _min( 40000, MAX );  # but keep it even
use constant MAXLEN => MAX * 1;             # ?? # x(1..2)

use Time::HiRes 'time';
my $t = time;

# create a raw file for lengths
writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" );

# memory map the raw file
my $lengths = mapfraw( "$tmp_dir/lengths" );
$lengths-> inplace-> setvaltobad( 1 );
$lengths-> set( 1, 1 );
$lengths-> set( 2, 2 );
$lengths-> set( 4, 3 );

my @top_seqs;

MCE::Flow->init(
    max_workers => MCE::Util::get_ncpu(),
    chunk_size  => CHUNK + 1,
    bounds_only => 1,
    init_relay  => 1,
    gather      => MCE::Candy::out_iter_array(\@top_seqs),

    user_end    => sub {
        # wait for any remaining workers to complete processing
        MCE->sync;

        my $size = MAX / MCE->max_workers + 1;
        my $from = ( MCE->wid - 1 ) * $size + 1;
        my $to   = $from + $size;

        $from++   if $from > 1;
        $to = MAX if $to   > MAX;

        my $lengths_c = $lengths-> slice([ $from, $to ]);
        $lengths_c-> badflag( 0 );

        my $sorted_i = $lengths_c-> qsorti;
        my $sorted   = $lengths_c-> index( $sorted_i );
        my $value    = $sorted-> at( $to - $from - TOP );
        my $pos      = vsearch_insert_leftmost( $value, $sorted );
        my $top_i    = $sorted_i-> slice([ $to - $from, $pos ]);

        ( my $result = $lengths_c
            -> index( $top_i )
            -> longlong
            -> bitnot
            -> cat( $top_i + $from )
            -> transpose
            -> qsortvec
            -> slice([], [ 0, TOP - 1 ])

        )-> slice([ 0 ], [])
         -> inplace
         -> bitnot;

        # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ],
        my $str = $result->string;
        $str =~ s/(\d+)\s+(\d+)(.*)/$2,$1$3,/g;
        my $ret = eval $str;

        MCE->gather( MCE->wid, @$ret );
    },
);

mce_flow_s sub {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    my ( $from, $to ) = @{ $chunk_ref };

    my $seqs_c = $from + sequence( longlong, $to - $from + 1 );

    if ( $chunk_id == 1 ) {
        $seqs_c-> setbadat( 0 );
        $seqs_c-> setbadat( 1 );
        $seqs_c-> badvalue( 2 );
    }
    else {
        $seqs_c-> setbadat(  $from % 2 ? 1 : 0 );
        $seqs_c->    slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2
+;
        $seqs_c-> badvalue( 2 );
    }

    my $lengths_c = $lengths-> slice([ $from, $to ]);
    my $current   = zeroes( short, nelem( $seqs_c ));

    while ( any $seqs_c-> isgood ) {

        my ( $seqs_c_odd, $current_odd_masked )
            = where( $seqs_c, $current, $seqs_c & 1 );

        $current_odd_masked ++;
        $current ++;
        ( $seqs_c_odd *= 3 ) ++;
        $seqs_c >>= 1;

        my ( $seqs_cap, $lengths_cap, $current_cap )
            = where( $seqs_c, $lengths_c, $current,
                $seqs_c <= MAXLEN );

        my $lut = $lengths-> index( $seqs_cap );

        # "_f" is for "finished"

        my ( $seqs_f, $lengths_f, $lut_f, $current_f )
            = where( $seqs_cap, $lengths_cap, $lut, $current_cap,
                $lut-> isgood );

        $lengths_f .= $lut_f + $current_f;
        $seqs_f    .= 2;                    # i.e. BAD
    }

    # "_e" is for "at even positions, ahead"                    ##
                                                                ##
#   my $from_e = _max( $from * 2, $to ) + 2;        # bug       ##
    my $from_e = $from == 0 ? $to + 2 : $from * 2;  # fixed     ##
    my $to_e   = _min( $to * 2, MAXLEN );                       ##
                                                                ##
    MCE::relay {                                                ##
      ( $lengths-> slice([ $from_e, $to_e, 2 ])                 ##
          .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++   ##
              if $from_e <= MAXLEN;                             ##
    };                                                          ##

}, 0, MAX;

MCE::Flow->finish;

@top_seqs = ( sort { $b->[1] <=> $a->[1]} @top_seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
    for @top_seqs;

say {*STDERR} time - $t;
[download]

Windows:

The following script works on Windows with up to 8 workers. Specifying higher than 8 workers causes PDL to emit, "PDL::Internal Error: data structure recursion limit exceeded (max 1000 levels)". I also tested on Linux. No problems there including 32 workers.

use strict;
use warnings;
use feature 'say';

use PDL;
use File::Map; # ensure that Perl has File::Map before loading FastRaw
use PDL::IO::FastRaw;

use MCE::Signal '$tmp_dir';
use MCE::Flow;
use MCE::Candy;

{
    no warnings 'once'; $PDL::BIGPDL = 1;
    eval q{ PDL::set_autopthread_targ(1) };
}

use List::Util;
BEGIN { *_min = \&List::Util::min;          # collision
        *_max = \&List::Util::max }         # with PDL

use constant MAX    => shift || 1e7;
use constant TOP    => _min( 20, MAX );
use constant CHUNK  => _min( 40000, MAX );  # but keep it even
use constant MAXLEN => MAX * 1;             # ?? # x(1..2)

use Time::HiRes 'time';
my $t = time;

# create a raw file for lengths
writefraw( ones( short, 3 + MAXLEN ), "$tmp_dir/lengths" );

my $max_workers = $^O eq 'MSWin32' ? 8 : MCE::Util::get_ncpu();
my @top_seqs;
my $lengths;

MCE::Flow->init(
    max_workers => _min( $max_workers, MCE::Util::get_ncpu() ),
    chunk_size  => CHUNK + 1,
    bounds_only => 1,
    init_relay  => 1,
    gather      => MCE::Candy::out_iter_array(\@top_seqs),

    user_begin  => sub {
        $lengths = mapfraw( "$tmp_dir/lengths" );
        if ( MCE->wid == 1 ) {
            $lengths-> inplace-> setvaltobad( 1 );
            $lengths-> set( 1, 1 );
            $lengths-> set( 2, 2 );
            $lengths-> set( 4, 3 );
        }
        MCE->sync;
    },

    user_end    => sub {
        # wait for any remaining workers to complete processing
        MCE->sync;

        my $size = MAX / MCE->max_workers + 1;
        my $from = ( MCE->wid - 1 ) * $size + 1;
        my $to   = $from + $size;

        $from++   if $from > 1;
        $to = MAX if $to   > MAX;

        my $lengths_c = $lengths-> slice([ $from, $to ]);
        $lengths_c-> badflag( 0 );

        my $sorted_i = $lengths_c-> qsorti;
        my $sorted   = $lengths_c-> index( $sorted_i );
        my $value    = $sorted-> at( $to - $from - TOP );
        my $pos      = vsearch_insert_leftmost( $value, $sorted );
        my $top_i    = $sorted_i-> slice([ $to - $from, $pos ]);

        ( my $result = $lengths_c
            -> index( $top_i )
            -> longlong
            -> bitnot
            -> cat( $top_i + $from )
            -> transpose
            -> qsortvec
            -> slice([], [ 0, TOP - 1 ])

        )-> slice([ 0 ], [])
         -> inplace
         -> bitnot;

        # From PDL to Perl: [ 0 1 ] becomes [ 1, 0 ],
        my $str = $result->string;
        $str =~ s/(\d+)\s+(\d+)(.*)/$2,$1$3,/g;
        my $ret = eval $str;

        MCE->gather( MCE->wid, @$ret );
    },
);

mce_flow_s sub {
    my ( $mce, $chunk_ref, $chunk_id ) = @_;
    my ( $from, $to ) = @{ $chunk_ref };

    my $seqs_c = $from + sequence( longlong, $to - $from + 1 );

    if ( $chunk_id == 1 ) {
        $seqs_c-> setbadat( 0 );
        $seqs_c-> setbadat( 1 );
        $seqs_c-> badvalue( 2 );
    }
    else {
        $seqs_c-> setbadat(  $from % 2 ? 1 : 0 );
        $seqs_c->    slice([ $from % 2 ? 1 : 0, $to - $from, 2 ]) .= 2
+;
        $seqs_c-> badvalue( 2 );
    }

    my $lengths_c = $lengths-> slice([ $from, $to ]);
    my $current   = zeroes( short, nelem( $seqs_c ));

    while ( any $seqs_c-> isgood ) {

        my ( $seqs_c_odd, $current_odd_masked )
            = where( $seqs_c, $current, $seqs_c & 1 );

        $current_odd_masked ++;
        $current ++;
        ( $seqs_c_odd *= 3 ) ++;
        $seqs_c >>= 1;

        my ( $seqs_cap, $lengths_cap, $current_cap )
            = where( $seqs_c, $lengths_c, $current,
                $seqs_c <= MAXLEN );

        my $lut = $lengths-> index( $seqs_cap );

        # "_f" is for "finished"

        my ( $seqs_f, $lengths_f, $lut_f, $current_f )
            = where( $seqs_cap, $lengths_cap, $lut, $current_cap,
                $lut-> isgood );

        $lengths_f .= $lut_f + $current_f;
        $seqs_f    .= 2;                    # i.e. BAD
    }

    # "_e" is for "at even positions, ahead"                    ##
                                                                ##
#   my $from_e = _max( $from * 2, $to ) + 2;        # bug       ##
    my $from_e = $from == 0 ? $to + 2 : $from * 2;  # fixed     ##
    my $to_e   = _min( $to * 2, MAXLEN );                       ##
                                                                ##
    MCE::relay {                                                ##
      ( $lengths-> slice([ $from_e, $to_e, 2 ])                 ##
          .= $lengths-> slice([ $from_e / 2, $to_e / 2 ])) ++   ##
              if $from_e <= MAXLEN;                             ##
    };                                                          ##

}, 0, MAX;

MCE::Flow->finish;

@top_seqs = ( sort { $b->[1] <=> $a->[1]} @top_seqs )[ 0..19 ];

printf "Collatz(%5d) has sequence length of %3d steps\n", @$_
    for @top_seqs;

say {*STDERR} time - $t;
[download]

Results:

time perl script.pl

1e7:
    serial    15.311s   1 core
  parallel     7.898s   2 cores
  parallel     4.229s   4 cores
  parallel     2.244s   8 cores
  parallel     1.265s  16 cores
  parallel     0.815s  32 cores

1e8:
  serial    2m38.645s   1 core
  parallel    11.779s  32 cores  before: serial qsorti
  parallel     6.652s  32 cores  after : parallel qsorti

  parallel     2.656s  32 cores  non-PDL solution
  parallel     1.644s  32 cores  non-PDL solution with Inline::C
               https://perlmonks.org/?node_id=11115780

Collatz(63728127) has sequence length of 950 steps
Collatz(95592191) has sequence length of 948 steps
Collatz(96883183) has sequence length of 811 steps
Collatz(86010015) has sequence length of 798 steps
Collatz(98110761) has sequence length of 749 steps
Collatz(73583070) has sequence length of 746 steps
Collatz(73583071) has sequence length of 746 steps
Collatz(36791535) has sequence length of 745 steps
Collatz(55187303) has sequence length of 743 steps
Collatz(56924955) has sequence length of 743 steps
Collatz(82780955) has sequence length of 741 steps
Collatz(85387433) has sequence length of 741 steps
Collatz(63101607) has sequence length of 738 steps
Collatz(64040575) has sequence length of 738 steps
Collatz(93128574) has sequence length of 736 steps
Collatz(93128575) has sequence length of 736 steps
Collatz(94652411) has sequence length of 736 steps
Collatz(96060863) has sequence length of 736 steps
Collatz(46564287) has sequence length of 735 steps
Collatz(69846431) has sequence length of 733 steps
[download]

Regards, Mario

Comment on Re^5: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) Select or Download Code

Replies are listed 'Best First'.
Re^6: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on Apr 22, 2022 at 13:11 UTC
I was intrigued at that Windows error for >8 workers; does that still happen for recent versions of PDL? If so, could you create a GitHub issue?	[reply]
Re^7: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on Apr 23, 2022 at 16:15 UTC
Hi, etj I updated the example due to hanging on Windows using PDL 2.078. I will make a new MCE release and have MCE do this automatically. `PDL::set_autopthread_targ(1)` [download] The recursion limit is still an issue beyond 8 workers on the Windows platform. It now also happens randomly with 8 workers using recent PDL 2.078.	[reply] [d/l]
Re^8: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on Apr 24, 2022 at 13:15 UTC
A long, exhaustive search of the PDL source code (using https://github.com/PDLPorters/pdl/search?q=structure+recursion) shows where that message originates (which interestingly it looks like you retyped rather than copy-pasting - the actual message says "PDL:Internal"). The macro there (since at least v1.99987, from 1998) uses a process-global, function-static `__nrec` to attempt to track recursion depth. The problem on Windows will be because in Perl, its "fork" actually just makes a new thread. C global variables will still be process-global, so that variable will be getting incremented by lots of different threads, both POSIX threads and process-faking threads. The solution to this might be attempted by using some sort of thread-local storage to limit the scope of that variable. A much better solution would be to change the relevant functions to just pass a depth-count as a stack parameter, which would obviate this whole problem. Separately, turning off PDL autopthread behaviour seems to me the correct behaviour for MCE. Otherwise you're having two different types of parallelism, which seems likely to cause chaos.	[reply] [d/l]
Re^9: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on Apr 24, 2022 at 15:44 UTC
Re^8: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on May 03, 2022 at 19:55 UTC
Good news, everyone! PDL 2.079 has been released with the (I hope) fix for this, as mentioned in 11143248 above. See separate announcement. marioroy Could you try it and see if it does in fact help?	[reply]
Re^9: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by marioroy (Prior) on May 17, 2022 at 20:21 UTC
Re^10: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster) by etj (Deacon) on May 18, 2022 at 23:13 UTC
Some notes below your chosen depth have not been shown here

In Section Meditations