comment on

Greetings, dideod.yang,

The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations.

Serial

use strict;
use warnings;

open my $input_fh,  "<", "test.txt"   or die "open error: $!";
open my $sample_fh, ">", "sample.txt" or die "open error: $!";
open my $good_fh,   ">", "good.txt"   or die "open error: $!";

while (<$input_fh>) {
    if (/^sample\s+(\S+)/) {
        print $sample_fh $1, "\n";
    }
    elsif (/^good\s+(\S+)/) {
        print $good_fh $1, "\n";
    }
}

close $input_fh;
close $sample_fh;
close $good_fh;
[download]

Parallel

use strict;
use warnings;

use MCE;

open my $sample_fh, ">", "sample.txt" or die "open error: $!";
open my $good_fh,   ">", "good.txt"   or die "open error: $!";

# worker function

sub task {
    my ( $mce, $slurp_ref, $chunk_id ) = @_;
    my ( $sample_buf, $good_buf ) = ('', '');

    # open file handle to scalar ref
    open my $input_fh, "<", $slurp_ref;

    # append to buffers inside the loop
    while (<$input_fh>) {
        if (/^sample\s+(\S+)/) {
            $sample_buf .= $1 . "\n";
        }
        elsif (/^good\s+(\S+)/) {
            $good_buf .= $1 . "\n";
        }
    }

    close $input_fh;

    # Send buffers to the manager process to print accordingly.
    # This prevents parallel workers from garbling output handles.

    MCE->print($sample_fh, $sample_buf);
    MCE->print($good_fh, $good_buf);
}

# spawn workers early, optionally
my $mce = MCE->new(
    chunk_size  => '2m',  # 2 megabytes
    max_workers => 4,
    use_slurpio => 1,
    user_func   => \&task,
)->spawn;

# process input file(s)
$mce->process({ input_data => "test.txt" });

# shutdown workers
$mce->shutdown;

# close output handles
close $sample_fh;
close $good_fh;
[download]

50 million test

The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.

$ time perl test_serial.pl

real    0m22.225s
user    0m22.018s
sys     0m0.171s

$ time perl test_parallel.pl

real    0m5.887s
user    0m22.925s
sys     0m0.293s
[download]

Regards, Mario

In reply to Re: About text file parsing by marioroy
in thread About text file parsing by dideod.yang

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks