Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to use my two processors

by remzak (Acolyte)
on Jan 18, 2010 at 04:06 UTC ( [id://817926]=note: print w/replies, xml ) Need Help??


in reply to Need a faster way to find matches

I managed to cut the run time in half (thank you all for your help). During this process, the algorithm was changed in a manner such that there is now opportunity for the logic to be implemented in parallel; previously the logic was strictly sequential.

I tried to use threads, but the numerous calls to create the threads created way too much overhead for any speed improvement. My program is not memory intensive; the memory usage maxes out at just over 4MB before the program finishes; yet, by using threads the required memory exceeded the 2GB on my machine.

Dual core machines are very common these days. I was hoping to re-write the program to use both processors. I was thinking of having a service run on one processor (which takes requests to find pairs of numbers) while the main program looks for valid numbers, waits for all the pairs to be found, then finishes the last step of the process.

I'm using windows (it sounds like perl behaves slightly differently on windows). I need some type of inter-process communication mechanism to submit new requests, eventually wait and receive back the (hash table) results. At this point, I don't need detailed explanations, just your insights and recommendations on which modules are best suited to this problem.

Replies are listed 'Best First'.
Re: How to use my two processors
by BrowserUk (Patriarch) on Jan 18, 2010 at 13:00 UTC

    There are some gains to be had from using threads, but not as much as you might hope for. The following are the results of a single threaded test, and a multi-threaded version operating on the same list using 1, 2, 3 & 4 threads:

    C:\test>817827 Took 1.622 Odd numbers in list: 4000 Pairs found: 99135 C:\test>817827-t -T=1 Took 1.790000 Odd numbers in list: 4000 Pairs found: 99135 C:\test>817827-t -T=2 Took 1.365000 Odd numbers in list: 4000 Pairs found: 99135 C:\test>817827-t -T=3 Took 1.047500 Odd numbers in list: 4000 Pairs found: 99135 C:\test>817827-t -T=4 Took 0.832500 Odd numbers in list: 4000 Pairs found: 99135

    As you can see, the overhead of threading costs you relative to the non-threaded version for the 1-thread case. You make a small gain using two. And further small gains using 3 or 4. At the cost of complexity. The tests were run on a 4-core system.

    Note: These are only test apps; the results produced--a single level hash with composite keys--may not suit your purposes.

    Non-threaded:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; our $N ||= 4000; our $S ||= 1; srand( $S ); my @list = map 1 | int( rand( 2**32 )), 1 .. $N; my $start = time; my %hash; for my $i ( 0 .. $#list ) { use integer; my $v = $list[ $i ]; ( $v & $_ ) == 1 and undef $hash{ "$i:$_" } for @list[ $i+1 .. $#list ]; } printf "Took %.3f\n", time() - $start; print 'Odd numbers in list: ', scalar @list; print 'Pairs found: ', scalar keys %hash; #<>; pp \%hash;

    Threaded

    #! perl -slw use strict; use threads; use Time::HiRes qw[ time ]; use Data::Dump qw[ pp ]; $|++; our $N ||= 4000; our $T ||= 4; our $S ||= 1; srand( $S ); my @list = map 1|int( rand( 2**32 ) ), 1 .. $N; my $cStart = 0; my $cSize = int( @list / $T ); my $start = time; my @threads = map { my( $t ) = threads->create( sub{ my( $lo, $hi ) = @_; use integer; my $tid = threads->tid; my @keys; for my $i ( 0 .. $#list ) { my $v = $list[ $i ]; my $start = $i > $lo ? $i+1 : $lo; ( $v & $_ ) == 1 and push @keys, "$i:$_" for @list[ $start + .. $hi ]; } return @keys; }, $cStart, $_<($T) ? $cStart + $cSize + 1 : $#list ); $cStart += $cSize; $t; } 1 .. $T; my %hash; for( @threads ) { my @keys = $_->join; undef $hash{ $_ } for @keys; } printf "Took %.6f\n", time() - $start; print 'Odd numbers in list: ', scalar @list; print 'Pairs found: ', scalar keys %hash; <>; pp \%hash;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I appreciate your code; your partitioning approach was very helpful to see. I will see if I can incorporate it into my program (I've moved the logic around such that I embedded the search for pairs into the agorithm that creates the list... this one step made big improvements in the performance but it makes using threads more difficult).

      Using your program, my machine starts tailing out at six or seven threads. To get the results below, I added some loops, and ran each number of threads 20 times to get an average. Here are the results:

      Threads Used: 1 Time: 2.193750 Threads Used: 2 Time: 1.646408 Threads Used: 3 Time: 1.418214 Threads Used: 4 Time: 1.311989 Threads Used: 5 Time: 1.228125 Threads Used: 6 Time: 1.221352 Threads Used: 7 Time: 1.218237 Threads Used: 8 Time: 1.211988 Threads Used: 9 Time: 1.233855 Threads Used: 10 Time: 1.212500

      What I would really like to try is to create a parallel process/thread that receives new numbers to "pair-up" while the main process keeps working to build the numbers. The process would need an event mechanism to add a new number and at the end, return its list. I think POE provides just such a framework. This is much more complicated than partitioning the algorithm, so I'll try to leverage what you've shown me first.

        If you ran that on a 2-core system, I am pleasantly surprised that you continued to make appreciable gains with more than 2 threads. Very surprised that they were still relatively worthwhile for 3 and 4--are you sure you don't have 2 cpus each with 2 hyper-threads?

        I've little experience of POE. Historically, I've been wary of the overhead of IPC via *nix pipes, but it was demonstrated to me a year or two ago that things have moved on markedly since my days with HPUX-10, so it may work for you.

        As for pipelined threads: the problem is that unless the producer thread produces a steady and relatively high stream of output, the consumer thread tends to get a timeslice, wake up, process one item and then have to relinquish the rest of its timeslice because there's no more work waiting for it. That tends to lead to a low ratio of work done to context switch. That's why I favour the partitioning approach.

        Pipelined processes tend to alleviate that problem by buffering the pipe, thereby deferring the need to switch contexts until there is enough work for the consumer to make the context switch worthwhile. You can emulate that using threads by batching up the producer's output, but it tend to lead to code that needs to be re-tuned manually for each different hardware setup--you initially tune it for 2 cpus and then have to re-tune it if you move it to a 4-core box etc. Even a different workload can screw up the tuning. You set it up for a development box, but when you move it to an identical production box, you have to re-tune it because the workload mix is different.

        I've encountered similar problems with both event-driven and co-routine setups. If you buffer too little, you end up wasting cycles context switching to the consumer too frequently. If you buffer too much, the producer spins cycles waiting for the consumer to catch up.

        Good luck. Let us know how you get on. (I'm intrigued by the purpose of this?)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://817926]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-23 21:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found