Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^7: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)

by marioroy (Prior)
on Apr 23, 2022 at 16:15 UTC ( [id://11143236]=note: print w/replies, xml ) Need Help??


in reply to Re^6: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
in thread Optimizing with Caching vs. Parallelizing (MCE::Map)

Hi, etj

I updated the example due to hanging on Windows using PDL 2.078. I will make a new MCE release and have MCE do this automatically.

PDL::set_autopthread_targ(1)

The recursion limit is still an issue beyond 8 workers on the Windows platform. It now also happens randomly with 8 workers using recent PDL 2.078.

  • Comment on Re^7: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
  • Download Code

Replies are listed 'Best First'.
Re^8: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
by etj (Deacon) on Apr 24, 2022 at 13:15 UTC
    A long, exhaustive search of the PDL source code (using https://github.com/PDLPorters/pdl/search?q=structure+recursion) shows where that message originates (which interestingly it looks like you retyped rather than copy-pasting - the actual message says "PDL:Internal").

    The macro there (since at least v1.99987, from 1998) uses a process-global, function-static __nrec to attempt to track recursion depth. The problem on Windows will be because in Perl, its "fork" actually just makes a new thread. C global variables will still be process-global, so that variable will be getting incremented by lots of different threads, both POSIX threads and process-faking threads.

    The solution to this might be attempted by using some sort of thread-local storage to limit the scope of that variable. A much better solution would be to change the relevant functions to just pass a depth-count as a stack parameter, which would obviate this whole problem.

    Separately, turning off PDL autopthread behaviour seems to me the correct behaviour for MCE. Otherwise you're having two different types of parallelism, which seems likely to cause chaos.

      The change to using a recurse_count argument in the relevant function (PDL.make_physical and friends) has now been implemented in git (doesn't break ABI as the API functions are now wrappers for the recurse_count versions).
Re^8: Optimizing with Caching vs. Parallelizing (MCE::Map) (PDL: faster)
by etj (Deacon) on May 03, 2022 at 19:55 UTC
    Good news, everyone! PDL 2.079 has been released with the (I hope) fix for this, as mentioned in 11143248 above. See separate announcement.

    marioroy Could you try it and see if it does in fact help?

      Confirming on not seeing recursion limit warnings :)

      I tried various examples on PerlMonks. Getting WARNINGs with PDL 2.079 running demonstration 11115875. That was not the case before when the example was created.

      WARNING: PDL::Primitive::vsearch_insert_leftmost does not handle bad v +alues. WARNING: PDL::Primitive::vsearch_insert_leftmost does not handle bad v +alues. ...

      Unfortunately, vr's serial demonstration 11116069 runs poorly; 42s versus 19s targ(1). It requires disabling auto-parallelization.

      PDL::set_autopthread_targ(1);

      Using the 2nd example (the one for Windows, but also runs on UNIX) 11116094, there is no improvement beyond 8 workers on the Windows platform. On Linux, no problem where more workers up to the number of logical cores improves performance (less time). So I updated the code to cap at 8 workers max on Windows. Strawberry Perl 5.32.1.1 PDL edition (w/ included PDL 2.021) is noticeably faster versus PDL 2.079. Is that the case for you?

      PDL 2.021 using Strawberry Perl 5.32.1.1 PDL edition (max 8 workers on + Windows) perl demo_win.pl 1e7 : 3.778 seconds perl demo_win.pl 1e8 : 37.008 seconds PDL 2.079 using Strawberry Perl 5.32.1.1 PDL edition (PDL updated to 2 +.079) perl demo_win.pl 1e7 : 4.143 seconds perl demo_win.pl 1e8 : 42.302 seconds

      I ran the same demonstration on Ubuntu Linux 20.04 using Perl 5.30.0 and PDL 2.079.

      perl demo_win.pl 1e7 : 3.288 seconds ( 8 workers) perl demo_win.pl 1e8 : 41.899 seconds perl demo_win.pl 1e7 : 2.331 seconds (16 workers) perl demo_win.pl 1e8 : 23.329 seconds perl demo_win.pl 1e7 : 1.642 seconds (24 workers) perl demo_win.pl 1e8 : 16.941 seconds perl demo_win.pl 1e7 : 1.310 seconds (32 workers) perl demo_win.pl 1e8 : 13.198 seconds perl demo_win.pl 1e7 : 1.139 seconds (40 workers) perl demo_win.pl 1e8 : 10.925 seconds perl demo_win.pl 1e7 : 1.004 seconds (48 workers) perl demo_win.pl 1e8 : 9.791 seconds perl demo_win.pl 1e7 : 0.946 seconds (56 workers) perl demo_win.pl 1e8 : 8.913 seconds perl demo_win.pl 1e7 : 0.877 seconds (64 workers) perl demo_win.pl 1e8 : 8.305 seconds
        Glad to hear the recursion problem is solved.

        I haven't tried this, I'm afraid, as I'm currently fixing up PDL's macro mechanism, which has required a bit of a rejig of the whole code-generation stuff.

        It would be incredibly helpful if someone could run the two versions of PDL with Devel::NYTProf and reply here with roughly where the slowdown is. I appreciate it might be within the C code, but more information seems like it would be better.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11143236]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-04-20 00:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found