Re^2: How do you parallelize STDIN for large file processing?

Or, to look at it another way, you are passing the input a line at a time to the worker threads -- so two things are happening:

all of the thread communication overhead is being carried by every line -- I think this is the real issue.
you are streaming the output, so Thread::Pool arranges for the output of each job to be done in job order.

Having looked at the code, the way it does this is quite cunning: when a do routine completes a job before earlier jobs have completed, the results are put to one side, and the thread loops round to do the next job. When a job completes and all earlier have completed, the thread outputs its results and any results put aside earlier which can now be output.

Note that this means that if one job takes a very long time indeed, all the other threads will continue taking jobs and generating and storing the output. This is a Good Thing, because slower jobs are not blocking faster ones. However, I don't see a mechanism to stop jobs being dispatched if the streamed output backs up. But that could be because I haven't studied the code hard enough.

So apart from the cost of all this cunning, I don't think this is the issue.

Without changing the structure of what you have too much, I'd change:

  $pool->job($_) while (<STDIN>);
[download]

to something that reads a bunch of lines (say a 100K bytes worth), something like:

  while (!eof(STDIN)) {
    my $buffer ;
    read STDIN, $buffer, 100*1024   or die "$!" ;
    if (!eof(STDIN)) {
      my $nl = rindex($buffer, "\n") + 1  or die "No \\n in lump ?" ;
      if (my $trim = $nl - length($buffer) {
        seek STDIN, $trim, 1   or die "Seek failed ??" ;
        substr($buffer, $trim) = '' ;
      } ;
    } ;

    $pool->job($buffer) ;
  } ;
[download]

(I haven't tested this, so caveat emptor !). Then the processing subroutine needs to break its input into lines and process into another buffer, and then return that lump of results. Something like:

  sub do {
    my ($lump) = @_ ;

    my $result = '' ;
    while ($lump =~ m/(.+)/g) { # Ignores blank lines
      my $line = $1 ;

      ... whatever ...

      $result .= ...whatever..  ."\n" ;
    } ;

    return $result ;
  } ;
[download]

(also untested). There's some saving to be made by removing my ($lump) = @_ and doing while ($_[0] =~ m/(.+)/g) directly instead.

This will amortise the job overhead over a reasonable chunk of lines. Obviously you can experiment to see if larger chunks means faster processing.

BTW, your do subroutine starts:

  sub do {
    chomp;
    my $input = $_;
    $input = shift;
[download]

which doesn't look right to me. The arguments are passed to do() in the usual way, so hacking around $_ appears redundant.

Comment on Re^2: How do you parallelize STDIN for large file processing? Select or Download Code


XP is just a number
	PerlMonks