http://qs321.pair.com?node_id=579552


in reply to Could there be ThreadedMapReduce (and/or ForkedMapReduce) instead of DistributedMapReduce?

I haven't looked into MapReduce myself so this may not be what you're after, but if you're trying to take the pain out of threading I'd recommend having a look at Parallel::ForkManager. I've used it in the past and it seems sufficiently robust.
  • Comment on Re: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?

Replies are listed 'Best First'.
Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?
by tphyahoo (Vicar) on Oct 20, 2006 at 11:23 UTC
    I originally started with P::FM as well, because it did look simple. But this uses forks, not threads, and so couldn't be applied to my problem.

    (Specifically, concatenating a string in a distributed way, where order wasn't important: using parallel processing to concatenate a string, where order of concatenation doesn't matter.)

    It didn't work, in a way that violated my expectations, because parallelization, like I said, is painful. In this case I would have needed to use threads.

    Perhaps there could be a ForkedMapReduce as well as a ThreadedMapReduce. The important thing to me is a reliable abstraction that Does What I Mean.

      The important thing to me is a reliable abstraction that Does What I Mean.

      What's wrong with this?

        It has locks.

        BrowserUK missed that on the first go, and had to add them later. So, it's not a clean abstraction. If ikegami hand't stepped in, we might have had a subtle, hard to find bug. (There's several layers of dialogue between these two about that, and these are smart people, so that's what I'm talking about parallelism being very hard to wrap your mind around.)

        Here, it seems like not such a big deal. But the coded that I wanted to parallelize was also an artificually simple example.

        The point is, you couldn't just cut and paste code that was executing serially, and change it in one place, and have it executing parallelly. Close maybe, but not quite. But what if my example hadn't been so simple? What if there had been two loops I wanted to processes parallely; what if they interacted somehow? Pain.

        That said, I am giving this code a careful study to see if I can apply it to the threadedreduce stub code I am working on at Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?

Re^2: Could there be a ThreadedMapReduce (instead of DistributedMapReduce)?
by tphyahoo (Vicar) on Oct 20, 2006 at 16:07 UTC
    Okay, I have here stub code for a parallel "threadedreduce", using what I think is pretty much functional style programming.

    My goal is along the same lines as my post yesterday concerning concating letters (mentioned in op), but more honed. Basically, I have a slow function for demo purposes, and I want to speed it up by using code executed in parallel; but in such a way that I can abstract the ugly bigs (locks, threads, etcetera) into a module that I maintain separately.

    So, here, my function in in the coderef:  $slow_matches_b It's slow because it sleeps for one second. And what it does it, it returns true if the input matches the letter b.

    I then have sub slowgrep, which should do the same thing that grep does, but in a more explicit way. slowgrep is built on sub reduce. This function is similar to the reduce you get in List::Util, but like Frank Sinatra, here, I did it my way. Could be it's not quite the same. I actually modeled it after the javascript reduce example code in the joelonsoftware essay Can Your Programming Language Do This?.

    Reduce is one of the basic building blocks of functional programming, and half of the google's celebrated MapReduce. This particular implementation of Reduce executes serially.

    Then I have the sub fastgrep, which should also do the same thing that grep does, but be faster because it uses some form of parallelization. Every line of this function is the same as slowgrep, except it is based on threadedreduce instead of reduce.

    threadedreduce should do the same thing as reduce except... it doesn't. As the tests show. In fact, it is just a stub, and actually it has forks not threads, but you get the idea. I think this function could be implemented either with forks, or with threads, or with POE, or whatever your favorite technique for cranking up parallelization in your environment. The point is that it lets programmers process lists in a parallel way while hiding away the complexity of the parallelized code.

    This is where I am hoping the monastery can step in and help me out. Can someone out there make threadedreduce work? threadedreduce could then be used to build threadedgrep, threadedUserAgent (same as UserAgent::Parallel, but you can WWW::Mechanize as the UA instead of the UA inherited from LWP::UserAgent)... or... well... lots of things. And in a way that's beautiful, functional, and maintainable.

    $ ./test_reduce.pl not ok 1 - parallel-y executing code works # Failed test 'parallel-y executing code works' # in ./test_reduce.pl at line 15. fast matches: $VAR1 = []; ok 2 - serially executing code works slow matches: $VAR1 = [ 'blee', 'blah', 'bloo' ]; 1..2 # Looks like you failed 1 test of 2. $ cat test_reduce.pl #!/usr/bin/perl use strict; use warnings; use Test::More qw( no_plan ); use Data::Dumper; use Parallel::ForkManager; my $slow_matches_b = sub { sleep 1; return 1 if $_[0] =~ /b/; }; my $test_strings = [ ('blee','blah','bloo', 'qoo', 'fwee' ) ]; my $fast_matches = fastgrep( $slow_matches_b, $test_strings ); ok( @$fast_matches, "parallel-y executing code works" ); print "fast matches: " . Dumper($fast_matches); # should dump out blee, blah bloo, but not fwee or qoo my $slow_matches = slowgrep( $slow_matches_b, $test_strings ); ok( @$slow_matches, "serially executing code works" ); print "slow matches: " . Dumper($slow_matches); sub fastgrep { my $test_function = shift; my $array = shift; my $grep_builder = sub { my $matches = shift; my $test_el = shift; push @$matches, $test_el if $test_function->($test_el); return $matches; }; return threadedreduce ( $grep_builder, $array, []) } sub slowgrep { my $test_function = shift; my $array = shift; my $grep_builder = sub { my $matches = shift; my $test_el = shift; push @$matches, $test_el if $test_function->($test_el); return $matches; }; return reduce( $grep_builder, $array, []) } # just a stub, hoping someone can help me with the threading sub threadedreduce { my $function = shift; my $array = shift; my $init = shift; my $pm=new Parallel::ForkManager(10); my $result = $init; for my $el ( @$array) { $pm->start and next; $result = $function->( $result, $el); $pm->finish; } $pm->wait_all_children; return $result; } sub reduce { my $function = shift; my $array = shift; my $init = shift; my $result = $init; for my $el ( @$array) { $result = $function->( $result, $el) } return $result; }
    UPDATE: I'm actually thinking the promising avenue of approach may to use Parallel::Queue, along the lines diotalevi suggested yesterday in Re: using parallel processing to concatenate a string, where order of concatenation doesn't matter.