Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

About text file parsing

by dideod.yang (Sexton)
on Aug 28, 2018 at 23:15 UTC ( [id://1221282]=perlquestion: print w/replies, xml ) Need Help??

dideod.yang has asked for the wisdom of the Perl Monks concerning the following question:

HI monks. I have a few working about parsing text file. Sometimes text file's lines are so many. about 50,000,000 lines... I always open text file using open fuunction. then I need to wait long time to complete my script... Do you have any special function to open text file?? below is my sample script.
###### test.txt######## sample AA sample BB Not sample CC good boy good yyy bad aaa
open(FILE,"test.txt"); while(<FILE>){ if(/^sample\s+(\S+)/){push @sample,$1} if(/^good\s+(\S+)/){push @good,$1} } close(FILE);

Replies are listed 'Best First'.
Re: About text file parsing -- MCE
by Discipulus (Canon) on Aug 29, 2018 at 07:27 UTC
    Hello dideod.yang,

    if your file is huge a line by line processsing will result slow with any variation of the algorithm. But you can throw more CPUs at this with, hopefully, better results. While parallel programming is not so easy to implement correctly in Perl, a gentle monk, marioroy, spent a lot of time and energy to help us, producing MCE and it seems that the second example of the documentation can be easely modified to suit your needs.

    The example uses MCE::Loop to work on a file in chunks: pay attention to OS dependant implementation inside the mce_loop_f call below and choose the appropriate one for your OS

    # from MCE docs: https://metacpan.org/pod/MCE use MCE::Loop; MCE::Loop::init { max_workers => 8, use_slurpio => 1 }; my $pattern = 'something'; my $hugefile = 'very_huge.file'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; # Quickly determine if a match is found. # Process the slurped chunk only if true. if ($$slurp_ref =~ /$pattern/m) { my @matches; # The following is fast on Unix, but performance degrades # drastically on Windows beyond 4 workers. open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; # Therefore, use the following construction on Windows. while ( $$slurp_ref =~ /([^\n]+\n)/mg ) { my $line = $1; # save $1 to not lose the value push @matches, $line if ($line =~ /$pattern/); } # Gather matched lines. MCE->gather(@matches); } } $hugefile; print join('', @result);

    L*

    UPDATE you can also be interested in some other tecniques you can find in my library

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: About text file parsing
by Corion (Patriarch) on Aug 29, 2018 at 08:45 UTC

    Have you timed how fast you can read the file at all? Maybe reading the file is what limits your speed?

    Do you have enough RAM to keep all the data you are extracting in arrays? Maybe writing the output into separate files immediately makes things faster. At least it makes certain that your program uses far less RAM.

Re: About text file parsing
by davido (Cardinal) on Aug 29, 2018 at 18:45 UTC

    I did the following:

    perl -e 'open my $outfh, ">", "sample.txt"; while ($i++ < 50_000_000) +{print $outfh "abcdefghijklmnopqrstuvwxyz0123456789\n";}'

    On my laptop with an SSD that took about fifteen seconds to run. Then I did this:

    perl -E 'open my $infh, "<", "sample.txt"; while(<$infh>) {$i++} say $ +i;'

    And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays.

    If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer.

    Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:

    my %dispatch = ( sample => sub {my $capture = shift; # do something with $capture} +, good => sub {my $capture = shift; # do something with $capture} +, ); while(<FILE>) { if (/^(sample|good)\s+(\S+)/) { $dispatch{$1}->($2); } }

    As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint.


    Dave

Re: About text file parsing
by bliako (Monsignor) on Aug 29, 2018 at 10:52 UTC

    If you insert the file in a RAM disk (as TheloniusMonk suggested) (see reply below) and also insert your results in RAM (push @sample...) you will need a bit or much more RAM to run it. Also copying the file in the RAM disk, which the OS will do for you via cp, will take some time (although reading it line-by-line from normal disk with your script will possibly take longer). Then there are SSD disks and mechanical drives and with each the time benefits will be different. This is the easiest approach without you writing more code.

    The additional benefit if you go the RAM disk way is that you can keep your input files in the disk for multiple perl runs, until the next reboot or until you remove them from RAM. So the second time you run a similar script to find different patterns you will see a better time benefit because the input is already in RAM.

    If you go the parallel way (as Discipulus mentioned) then you are bound by the total IO bandwidth of your hard disk. And so the benefits also may be different than just multiplying by the time number of parallel workers. Although I am not sure whether splitting the file and copying it to different, physically, hard-disks will get you benefits. If the content of your file is just separate lines who do not depend on each other (e.g. it is not an XML spanning multiple lines) then you can break that large file into smaller chunks (and keep it that way) and see if that helps parallelisation (in conjuction with storing it to different disks): edit: split -l 1000000 input.txt will split the input in chunks of 1000000 lines each (in unix).

    If your Not sample CC lines are a lot then you can filter them out before running all the different regexes on each line of input or even before running that perl script: for example, via grep -v 'Not sample CC' input.txt | perl ... or with a perl one-liner filter, but I am not sure perl beats grep. Of course you need the filter-out lines to have a common regex to filter them out.

    And finally, if you do manage to remove all the Not sample CC lines, it is worth trying the following and see if it is faster (caveat: results in %inp will be in random order and not the order of insertion as with arrays) :

    open(FILE,"test.txt"); my %inp = (); while(<FILE>){ if(/^(.+?)\s+(\S+)/){ $inp{$1} = $2 } } close(FILE);

    Edit: If you want to pass the output of your command above to another command for further processing then the problem of waiting for a process to finish in order to get all its output out and run it through another command and so on has been solved a long time ago, it is called a pipeline and essentially is what you see in unix style cmd1 | cmd2 | cmd3 ... . cmd1 starts outputing results as soon as it reads its input (if it is a simple program as yours above), its output is immediately read by cmd2 which then spits its output as soon as the first line is read and on to cmd3 which finally gives you an output as soon as the first line of input is read by cmd1 plus the propagation time. So you save a lot of time and you have results coming out almost immediately. The provision is that processing one line or chunk of input must be independent of the following lines of input.

      I meant OP could store the arrays as files on a RAM disk, not the input file necessarily - though that is an interesting extra idea.
Re: About text file parsing
by SuicideJunkie (Vicar) on Aug 29, 2018 at 17:42 UTC

    XY problem question here...
    What are you going to be using those arrays for? Are they huge, or is the sample/good a small subset of the input data? There may be better ways to approach the entire task.

    Perhaps consider something more like this:

    use strict; use warnings; open my $ifh, '<', 'test.pl' or die; open my $ofh_samples, '>', 'samples.txt' or die; open my $ofh_good, '>', 'goodlines.txt' or die; $|=1; my $total = -s 'test.pl'; my $progress = 0; my $linecount = 0; while (my $line = <$ifh>) { $linecount++; $progress += length ($line); print $ofh_samples $1 if $line =~ /^sample\s+(\S+)/; print $ofh_good $1 if $line =~ /^good\s+(\S+)/; printf "Processing... %3.1f%% completed... \r", 100*$progress/$tot +al unless $linecount %100; }
    That will keep little more than one line in memory at a time, and you can then deal with the pieces separately. Uncomment the comments to give a progress display.

Re: About text file parsing
by marioroy (Prior) on Aug 30, 2018 at 12:48 UTC

    Greetings, dideod.yang,

    The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations.

    Serial

    use strict; use warnings; open my $input_fh, "<", "test.txt" or die "open error: $!"; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; while (<$input_fh>) { if (/^sample\s+(\S+)/) { print $sample_fh $1, "\n"; } elsif (/^good\s+(\S+)/) { print $good_fh $1, "\n"; } } close $input_fh; close $sample_fh; close $good_fh;

    Parallel

    use strict; use warnings; use MCE; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; # worker function sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( $sample_buf, $good_buf ) = ('', ''); # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; # append to buffers inside the loop while (<$input_fh>) { if (/^sample\s+(\S+)/) { $sample_buf .= $1 . "\n"; } elsif (/^good\s+(\S+)/) { $good_buf .= $1 . "\n"; } } close $input_fh; # Send buffers to the manager process to print accordingly. # This prevents parallel workers from garbling output handles. MCE->print($sample_fh, $sample_buf); MCE->print($good_fh, $good_buf); } # spawn workers early, optionally my $mce = MCE->new( chunk_size => '2m', # 2 megabytes max_workers => 4, use_slurpio => 1, user_func => \&task, )->spawn; # process input file(s) $mce->process({ input_data => "test.txt" }); # shutdown workers $mce->shutdown; # close output handles close $sample_fh; close $good_fh;

    50 million test

    The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.

    $ time perl test_serial.pl real 0m22.225s user 0m22.018s sys 0m0.171s $ time perl test_parallel.pl real 0m5.887s user 0m22.925s sys 0m0.293s

    Regards, Mario

      Hi again,

      One may want to have the manager-process receive and loop through @sample and @good. That will incur an additional CPU core for the manager-process itself.

      use strict; use warnings; use MCE; open my $sample_fh, ">", "sample.txt" or die "open error: $!"; open my $good_fh, ">", "good.txt" or die "open error: $!"; # worker function sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; my ( @sample, @good ); # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; # append to scalars inside the loop while (<$input_fh>) { if (/^sample\s+(\S+)/) { push @sample, $1; } elsif (/^good\s+(\S+)/) { push @good, $1; } } close $input_fh; # send arrays to the manager-process MCE->gather(\@sample, \@good); } # manager function sub gather { my ( $sample, $good ) = @_; # process sample for ( @{ $sample } ) { ; } # process good for ( @{ $good } ) { ; } } # spawn workers early, optionally my $mce = MCE->new( chunk_size => '1m', # 1 megabyte max_workers => 4, use_slurpio => 1, user_func => \&task, gather => \&gather, )->spawn; # process input file(s) $mce->process({ input_data => "test.txt" }); # shutdown workers $mce->shutdown; # close output handles close $sample_fh; close $good_fh;

      The extra time comes from workers appending to local arrays. Likewise, the manager-process receiving and looping through the arrays. There are 4 workers and the manager process running simultaneously on a machine with 4 real cores.

      $ time perl test_demo.pl real 0m9.932s user 0m43.956s sys 0m0.452s

      Update:

      Interestingly, Perl v5.20 and higher take 2x longer to run. I'm not sure why. Yikes, possibly from regular expression? This is on my TODO list to check why. The above was captured from Perl v5.18.2 on the same machine.

      $ time /opt/perl-5.20.3/bin/perl test_demo.pl real 0m20.858s user 1m20.164s sys 0m8.488s

      Regards, Mario

        Once again, hi :)

        Using a simplified demonstration, regular expression appears to be 3x slower in Perl v5.20 and higher. I'm not sure why.

        use strict; use warnings; use MCE; sub task { my ( $mce, $slurp_ref, $chunk_id ) = @_; # open file handle to scalar ref open my $input_fh, "<", $slurp_ref; while (<$input_fh>) { if (/^sample\s+(\S+)/) { ; } elsif (/^good\s+(\S+)/) { ; } } close $input_fh; } MCE->new( chunk_size => '1m', max_workers => 4, use_slurpio => 1, user_func => \&task ); MCE->process({ input_data => "test.txt" }); MCE->shutdown;

        Results

        $ time /opt/perl-5.8.9/bin/perl -I. test_demo.pl real 0m3.826s user 0m14.352s sys 0m0.133s $ time /opt/perl-5.10.1/bin/perl -I. test_demo.pl real 0m4.369s user 0m16.935s sys 0m0.126s $ time /opt/perl-5.12.5/bin/perl -I. test_demo.pl real 0m4.889s user 0m18.944s sys 0m0.134s $ time /opt/perl-5.14.4/bin/perl -I. test_demo.pl real 0m4.860s user 0m18.865s sys 0m0.127s $ time /opt/perl-5.16.3/bin/perl -I. test_demo.pl real 0m4.815s user 0m18.724s sys 0m0.129s $ time /opt/perl-5.18.4/bin/perl -I. test_demo.pl real 0m4.668s user 0m18.356s sys 0m0.116s $ time /opt/perl-5.20.3/bin/perl -I. test_demo.pl real 0m14.195s user 0m49.155s sys 0m7.282s $ time /opt/perl-5.22.4/bin/perl -I. test_demo.pl real 0m14.316s user 0m49.586s sys 0m7.041s $ time /opt/perl-5.24.3/bin/perl -I. test_demo.pl real 0m14.612s user 0m50.251s sys 0m7.531s $ time /opt/perl-5.26.1/bin/perl -I. test_demo.pl real 0m14.212s user 0m49.418s sys 0m6.999s $ time /opt/perl-5.28.0/bin/perl -I. test_demo.pl real 0m14.308s user 0m49.476s sys 0m7.137s

        Regards, Mario

Re: About text file parsing
by TheloniusMonk (Sexton) on Aug 29, 2018 at 08:31 UTC
    Your arrays might occupy gigs of virtual memory, which might be too much for your machine. Also there are overheads for the internal format in which Perl stores arrays. You might be better off creating a RAM drive to store these as files in memory, so that a specific amount of memory is reserved in advance and it is less memory than is required for Perl arrays. It depends on your OS how to achieve this. And any solution also depends on what you are about to do with these arrays.
Re: About text file parsing
by stevieb (Canon) on Aug 29, 2018 at 14:36 UTC

    Although I really like Discipulus's approach to parallel processing the file, I thought I'd throw out Tie::File as an option. I've used it a couple of times successfully years ago. It doesn't load the entire file at once; instead, it reads it in chunks and presents the file as an array.

    Instead of doing:

    while (<$fh>){ ... }

    You'd do something like the following after opening the file with the distribution:

    for (@fh){ ... }
Re: About text file parsing
by tybalt89 (Monsignor) on Aug 30, 2018 at 18:58 UTC

    See if it is faster reading big chunks at a time, like this simple test case (of course, modify it for your file).
    This only runs the regexes once for each chunk, instead of once per line.

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1221282 open my $fh, '<', \<<END; ###### test.txt######## sample AA sample BB Not sample CC good boy good yyy bad aaa END local $/ = \1e6; # or bigger chunk depending on your memory size while(<$fh>) # read big chunk { $_ .= do { local $/ = "\n"; <$fh> // ''}; # read any partial line push @sample, /^sample\s+(\S+)/gm; push @good, /^good\s+(\S+)/gm; } close($fh); print "sample = @sample\n good = @good\n";

    Outputs:

    sample = AA BB good = boy yyy

      That's cool, tybalt89. Each day, learn something new about Perl.

      I ran serially and parallel with "text.txt" containing 50 million lines. There is no slowness using Perl v5.20 and higher.

      Serial

      use strict; use warnings; open my $input_fh, '<', 'test.txt' or die "open error: $!"; open my $sample_fh, '>', 'sample.txt' or die "open error: $!"; open my $good_fh, '>', 'good.txt' or die "open error: $!"; # tybalt89's technique running serially # see https://www.perlmonks.org/?node_id=1221387 local $/ = \2e6; # or bigger chunk depending on your memory size while (<$input_fh>) { # read big chunk $_ .= do { local $/ = "\n"; <$input_fh> // ''}; # read any partial + line print $sample_fh join("\n", /^sample\s+(\S+)/gm), "\n"; print $good_fh join("\n", /^good\s+(\S+)/gm ), "\n"; } close $input_fh; close $sample_fh; close $good_fh;

      Parallel

      use strict; use warnings; use MCE; open my $sample_fh, '>', 'sample.txt' or die "open error: $!"; open my $good_fh, '>', 'good.txt' or die "open error: $!"; # tybalt89's technique running parallel # see https://www.perlmonks.org/?node_id=1221387 MCE->new( chunk_size => '1m', max_workers => 4, use_slurpio => 1, input_data => 'test.txt', user_func => sub { my ( $mce, $slurp_ref, $chunk_id ) = @_; local $_ = ${ $slurp_ref }; MCE->print($sample_fh, join("\n", /^sample\s+(\S+)/gm), "\n"); MCE->print($good_fh, join("\n", /^good\s+(\S+)/gm ), "\n"); } )->run; close $sample_fh; close $good_fh;

      Demo

      $ time /opt/perl-5.26.1/bin/perl demo_serial.pl real 0m15.662s user 0m15.025s sys 0m0.607s $ time /opt/perl-5.26.1/bin/perl demo_parallel.pl real 0m4.042s user 0m15.617s sys 0m0.345s

      Regards, Mario

Re: About text file parsing
by Marshall (Canon) on Aug 29, 2018 at 19:01 UTC
    Processing 50 million lines is going to be "slow".

    Do not push to an array if you can process the line right now!

    while(<FILE>) { if(/^sample\s+(\S+)/){process_sample($1)} if(/^good\s+(\S+)/) {process_good($1)} }
Re: About text file parsing
by Anonymous Monk on Aug 29, 2018 at 06:57 UTC

    Hi,

    How many gigabytes of RAM do you have? How many gigabytes is your file (4.7GB)?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1221282]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (2)
As of 2024-04-26 05:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found