How to optimize a regex on a large file read line by line ?


Perl Monk, Perl Meditation
	PerlMonks

How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte)

on Apr 16, 2016 at 13:35 UTC ( [id://1160637]=perlquestion: print w/replies, xml )

Need Help??

John FENDER has asked for the wisdom of the Perl Monks concerning the following question:

Hi, i'm currently making some basic tests on parsing huges files for security works, searching in them a basic regex. As the files could be more than 10 gb, i can't load it fully in memory, so i have to read it line by line. My standard test is to count the number of line and search 123456$ regexp. I need to do both : count the number of line in the file and make a search and count hte number of result. Here is my code

open (FH, '<', "../Tests/10-million-combos.txt");
$counter=0;
$counter2=0;
while (<FH>) {
    if (/123456$/) {++$counter2;}
    ++$counter;
}
print "Num. Line : $counter - Occ : $counter2\n";
close FH;
[download]

It's simple but for a simple file of 2 Gb it takes 12,6 min !!! I suspect i did something wrong as Perl is a fast language, but i not good enought to know what. Please help ! Thanks.

Comment on How to optimize a regex on a large file read line by line ? Download Code

Replies are listed 'Best First'.

Re: How to optimize a regex on a large file read line by line ?
by AnomalousMonk (Archbishop) on Apr 16, 2016 at 15:19 UTC

... for a simple file of 2 Gb it takes 12,6 min ...

Wait... Over 12 minutes to process a 2 GB file in the simple way you've shown?!? I put together a 10,000,000 line file of 200 characters per line, with the last six characters '000000' .. '999999', and processing with your code took just over 20 seconds on my laptop (update: although some later runs took just over 40 seconds). (Generating the file only took about 40 seconds!)

If I understand your 12 minute claim correctly, I have a sneaking suspicion that you're not showing us the code you're actually running. It's important to show real code and not "It's just like as if it was this code..."

Update: If, however, the time is actually on the order of 12 seconds, I honestly don't think you're going to do a great deal better; such a time would seem pretty good to me.

Give a man a fish: <%-{-{-{-<

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 15:40 UTC

I'm currently hiding nothing :).

I've the latest ActiveState PERL installed on my machine (ActivePerl-5.22.1.2201-MSWin32-x64-299574).

I've uploaded on my FTP both file i used for my tests. I'm running a Windows 10 home edition (it's my personnal laptop, as i'm at home these days), with a Quad Core 3.1/16 Gb.

To give you an idea, a grep + wc command give me a result of 10 s, java or c#, 30s, c++, 48s, php 7, 50s, ruby 85s, python, 346s, powershell 682s, VBS, 1031s, Free Pascal,72,58s, VB.NET,100,63s...

Maybe something related to the perl distribution you think ? I will try with another distribution.

Re^3: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 16, 2016 at 15:47 UTC

I suppose grep does the same like I suggested before, reading large chunks into memory and trying to match multiple lines at once.

Another option is to fork into four child's each processing a quarter to use the full power of your machine.

And btw using lexical variables declared with my should help a little too.

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Re^4: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 16:16 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 16, 2016 at 19:33 UTC

Re^5: How to optimize a regex on a large file read line by line ? (timing)

by LanX (Saint) on Apr 16, 2016 at 20:51 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 16, 2016 at 16:52 UTC

Some notes below your chosen depth have not been shown here

Re^4: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 18:25 UTC

Re^3: How to optimize a regex on a large file read line by line ?

by AnomalousMonk (Archbishop) on Apr 16, 2016 at 15:51 UTC

But do you confirm that the processing time with Perl for the OPed code is in excess of 12 minutes? That's what would be shocking to me.

Someone else would have to advise about differences between distributions (I'm running Strawberry 5.14.4.1 for my tests (update: on Windows 7)), but I would be flabbergasted by such a performance difference.

Give a man a fish: <%-{-{-{-<

Re^4: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 16:05 UTC

Re^3: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 15:47 UTC

By the way, here is the full 2 Gb dict i'm using for tests :

http://mab.to/tbT8VsPDm

Please give me your execution times with the same code, your plateform, it's interesting.

Re^4: How to optimize a regex on a large file read line by line ?

by poj (Abbot) on Apr 16, 2016 at 16:28 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 18:01 UTC

Some notes below your chosen depth have not been shown here

Re^4: How to optimize a regex on a large file read line by line ?

by polettix (Vicar) on Apr 16, 2016 at 16:57 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 18:03 UTC

Some notes below your chosen depth have not been shown here

Re: How to optimize a regex on a large file read line by line ?
by LanX (Saint) on Apr 16, 2016 at 14:32 UTC

Anyway a sliding window always reading large chunks and adapting the regex to operate on multiple lines could speed up things a little.

Edit: see Re: Memory Leak when slurping files in a loop (sliding window explained)

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Re: How to optimize a regex on a large file read line by line ?
by graff (Chancellor) on Apr 16, 2016 at 16:29 UTC

$.

print "Num. Line : $. - Occ : $counter2\n";
[download]

I fetched the "10-million-combos.txt.zip" file you cited in one of the replies above, and noticed that it contains just the one text file. In terms of benchmarking, you might find that a command-line operation like this:

unzip -p 10-million-combos.txt.zip | perlscript
[download]

Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/. UPDATE: actually, there would be 2 matches on a windows system, and I find those two on my machine if I search for /123456\r\n$/.

I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline, assuming that this would be the fastest possible way to do your regex search-and-count, but then I tried it out on your actual data and got a surprise (running on a macbook pro, osx 10.10.5, 2.2GHz intel core i7, 4GB ram):

$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.30 real         3.25 user         0.01 sys
$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.23 real         3.22 user         0.01 sys
$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.18 real         3.17 user         0.01 sys

$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.96 real         1.89 user         0.02 sys
$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.96 real         1.93 user         0.02 sys
$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.93 real         1.90 user         0.02 sys
[download]

(If I remove the "$" from the regex, looking for 123456 anywhere on any line, I find three matches, and the run times are just a few percent longer overall.)

[reply]
[d/l]
[select]

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 17:55 UTC

"The predefined global variable $. does that for you"

Wasn't aware of this trick, thanks !

"Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/."

Hahem, sound like i've done something wrong while zipping the file. Now the 19x mb file containing 10 millions password are updated in the right way. You will find 10000000 lines in it, and 61466 with the regex 123456$.

"unzip -p 10-million-combos.txt.zip | perlscript"

Currently i'm working on txt file only. But it's interesting. I've done your test like that :

    echo 1:%time%
    unzip -p 10-million-combos.zip | grep 123456$ | wc -l
    echo 2:%time%
    grep 123456$ 10-million-combos.txt  | wc -l
    echo 3:%time%
    pause
[download]

Result :

1:19:16:46,11
  61466
2:19:16:48,43
  61466
3:19:16:49,00
[download]

0,58 in plaintext, 2,27 in zip file piped.

More now with your command line

zip piped : 3,89

unzip -p "C:\Users\admin\Desktop\10-million-combos.zip" | perl -ne "BE
+GIN{$n=0} $n++ if /123456$/; END{print $n}"

plain text : 5,16

type "C:\Users\admin\Desktop\10-million-combos.txt" | perl -ne "BEGIN{
+$n=0} $n++ if /123456$/; END{print $n}")

perl direct : 2,29

perl "demo.pl"
[download]

=Fastest on my side stay the direct access to the plain text file either using grep or perl. Amazing to see the perl unzip goes faster than the plain text access with an inline command... The shell is strange sometimes...

"I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline"

Im' using the one you can find in the unix utils, i suppose it's the GNU one ported on windows. --version give me : grep (GNU grep) 2.4.2.

echo %time%& grep 123456$ C:\Users\admin\Desktop\10-million-combos.txt
+ | wc -l& echo %time%
echo %time%& type "C:\Users\admin\Desktop\10-million-combos.txt" | per
+l -ne "BEGIN{$n=0} $n++ if /123456$/; END{print $n}"& echo.&echo %tim
+e%
echo %time%& perl demo.pl& echo %time%
[download]

Give me :

19:43:28,91/61466/19:43:29,51 for grep (0,6)
19:45:29,51/61466/19:45:34,71 for perl (5,2)
19:46:13,27/61466/19:46:15,47 for perl (direct) (2,2)
[download]

[reply]
[d/l]
[select]

Re^3: How to optimize a regex on a large file read line by line ?

by graff (Chancellor) on Apr 18, 2016 at 09:02 UTC

Having now tested it for this situation (multiple times in quick succession to check for consistency), the difference in timing was negligible or slightly favoring reading the uncompressed file, so it seems my initial idea about the role of disk access was wrong: either it really doesn't make any difference, or else whatever difference it makes is washed out by the added overhead of the extra unzip process and/or the pipeline itself.

(The perl one-liner was still faster than the compiled "grep" utility on my machine, but YMMV - different machines will have different versions / compilations of both Perl and grep.)

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 18:00 UTC

I think the matter come from huge file. How many times took on your computer the same request on the 1,9 Gb dictionnary ?

http://mab.to/tbT8VsPDm

Re: How to optimize a regex on a large file read line by line ?
by Athanasius (Archbishop) on Apr 16, 2016 at 13:49 UTC

Hello John FENDER, and welcome to the Monastery!

Since you don’t print a result until the loop has finished, it appears that you expect the regex to match only once. In that case, you can cut the time substantially¹ by exiting the loop as soon as a match is found:

while (FH)
{
    ++$counter;

    if (/1234556$)
    {
        ++$counter2;
        last;
    }
}
[download]

See perlsyn#Loop-Control.

¹By half, on the average, if the matching line appears in a random location within the file.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 14:15 UTC

Hello Athanasius ! Thanks for your answer : i don't want to leave my loop until i know how many users with the password 123456$ i have in the file. Cheers.

Re^3: How to optimize a regex on a large file read line by line ?

by Athanasius (Archbishop) on Apr 16, 2016 at 14:26 UTC

Ah yes, I see. In that case, you’re going to have to read through the whole file, and I doubt there’s much you can do to speed up the loop.

BTW, when I saw the regex /123456$/, I assumed you wanted to match 123456 at the end of a line — that’s what the $ anchor means in a regex. If you want to match a literal $, you need to escape it: m{123456\$} or:

use strict;
use warnings;
use autodie;

...

my $password = '123456';

open(FH, '<', "../Tests/10-million-combos.txt");
$counter  = 0;
$counter2 = 0;

while (<FH>)
{
    ++$counter;
    ++$counter2 if /^Q$password/;
}

print "Num. Line : $counter - Occ : $counter2\n";
close FH;
[download]

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 17, 2016 at 09:23 UTC

Hello John FENDER,

The following is a parallel demonstration using MCE::Flow and MCE::Shared.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

open my $fh, "unzip -p 10-million-combos.zip |" or die "$!";

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow {
   chunk_size => '1m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $numlines, $occurances ) = ( 0, 0 );

   while ( $$chunk_ref =~ /([^\n]+\n)/mg ) {
      $numlines++;
      $occurances++ if ( $1 =~ /123456\r/ );
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, $fh;

close $fh;

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

The following construction reads the plain text file directly if already unzipped.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '1m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $numlines, $occurances ) = ( 0, 0 );

   while ( $$chunk_ref =~ /([^\n]+\n)/mg ) {
      $numlines++;
      $occurances++ if ( $1 =~ /123456\r/ );
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, "10-million-combos.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

[reply]
[d/l]
[select]

Re^2: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 17, 2016 at 16:19 UTC

Update: Shorten code

Hello again,

Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

open my $fh, "10-million-combos.zip |" or die "$!";

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow {
   chunk_size => '1m', max_workers => 8,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my $numlines   = @{ $chunk_ref };
   my $occurances = 0;

   for ( @{ $chunk_ref } ) {
      $occurances++ if /123456\r/;
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, $fh;

close $fh;

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

And finally, the construction for reading the plain text file directly.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '1m', max_workers => 8,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my $numlines   = @{ $chunk_ref };
   my $occurances = 0;

   for ( @{ $chunk_ref } ) {
      $occurances++ if /123456\r/;
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, "10-million-combos.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

[reply]
[d/l]
[select]

Re^3: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 17, 2016 at 22:17 UTC

It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system.

"It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76".

It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange !

Very happy anyway, i'm now close to the best performance i could got on my laptop with perl !

. Grep : 10,71
. Java : 25,95
. C#   : 30,05
. Perl : 32,53
. C++  : 41,3
. PHP  : 52,31
. Free Pascal : 76,46
. Delphi 7    : 78,14
. VB.NET      : 100,15
. Python      : 315,13
. PowerShell  : 681,93
. VBS         : 1031,63
. Ruby        : Failed to parse the file correctly.
[download]

Re^4: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 18, 2016 at 16:37 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 19, 2016 at 23:13 UTC

Some notes below your chosen depth have not been shown here

Re^4: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 17, 2016 at 22:48 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 18, 2016 at 06:31 UTC

Re: How to optimize a regex on a large file read line by line ?
by RichardK (Parson) on Apr 16, 2016 at 14:51 UTC

How long are the lines in your file? and how many lines is it reading in total? Maybe reading it a line at a time is not the best approach for your data set.

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 14:59 UTC

How long ? Well, it's could vary regarding the extract you can make and the data you would analyze. Some logs are huges, more than 2Gbs... For starting 10000000 lines for passwords log 185866729 lines for the dictionnary file The entry are not very long, nothing more than 8 or 16 chars i would say.

Re^3: How to optimize a regex on a large file read line by line ?

by RichardK (Parson) on Apr 16, 2016 at 16:35 UTC

There's no point trying to optimize your code if you're not sure what your data looks like. However index will be faster than a regex if you're only looking for a fixed string.

As other people have recommended, profile your code and find out where the time is going.

Re: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 16, 2016 at 14:34 UTC

Could you show a small, representative sample of the input, anonymized if necessary?

Re^2: How to optimize a regex on a large file read line by line ?

by John FENDER (Acolyte) on Apr 16, 2016 at 14:52 UTC

As i'm working for evaluation purpose on public data at first, it's not a matter. You could find here both a 10 millons file with users password, and an extract of a 2 Gb dictionary, cut at 100 Mb. http://john.fender.free.fr/Dev/PerlMonk/QueryOnPerl/

Re: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 18, 2016 at 01:26 UTC

I'd appreciate it if someone would take this "big buffer" approach and adopt it to the test case and get timings for it. I'm stuck on this small tablet so I can't test it myself.

http://ideone.com/LzaQI0

I don't even know how to paste it into this post, sorry

Re^2: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 18, 2016 at 05:21 UTC

Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache.

Update: Added the 'm' modifier to the regex operation.

Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE.

Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading.

Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors.

The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '24m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   my $numlines = $$chunk_ref =~ tr/\n//;
   my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;

   $counter1->incrby( $numlines );
   $counter2->incrby( $occurances );

}, "Dictionary2GB.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

Re^3: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 18, 2016 at 05:32 UTC

How do you handle a chunk that ends in the middle of the pattern? I did it by completing the partial line (see code line with comment "finish partial line").

Re^4: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 18, 2016 at 05:38 UTC

Re^3: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 18, 2016 at 10:20 UTC

Thanks for the timings. If possible, would you please also get a time for the grep+wc on your machine so we can tell how both these solutions compare to it.

Re^2: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 18, 2016 at 01:59 UTC

You need to seek back the longest possible match (here 8) before reading the next chunk.

Actually the correct number is something like min ( p ,m )

With p = chunksize - pos

and m = length of longest possible match

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Re^3: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 18, 2016 at 02:27 UTC

The match is only in one line, that's the purpose of the line

$_ .= <$fh> // '';
[download]

It completes a partial line.

Re^4: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 18, 2016 at 09:21 UTC

Re: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 21, 2016 at 15:26 UTC

Update: The time is 2.2 seconds using the same demonstration below on a Mac running the upcoming MCE 1.706 release. Running with four workers also completes in 2.2 seconds. Basically, have reached the the underlying hardware limitation.

Today, I looked at MCE to compare against the 2 GB plain text file residing in FS cache and not. Increasing the chunk_size value is beneficial, especially when the file does not exists in OS level FS cache.

With an update to the code, simply by increasing the chunk_size value from '1m' to '24m', the total time now takes 3.2 seconds to complete.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '24m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   my $numlines = $$chunk_ref =~ tr/\n//;
   my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;

   $counter1->incrby( $numlines );
   $counter2->incrby( $occurances );

}, "Dictionary2GB.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

One day, I will try another technique inside MCE to see if IO performance can be improved upon.

Resolved.

Re^2: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 21, 2016 at 15:59 UTC

How fast is grep+wc on your machine?

Re^3: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 21, 2016 at 16:33 UTC

Grep and egrep run slow on the Mac and do not know why.

   wc -l:      2.162 seconds
   grep -c:   45.316 seconds
[download]

Re^4: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 21, 2016 at 17:05 UTC

Re^2: How to optimize a regex on a large file read line by line ?

by LanX (Saint) on Apr 21, 2016 at 16:19 UTC

When comparing with grep/wc please also compare the one worker case, cause grep shouldn't be paralleling (AFAIK)

BTW: While we never saw the bash script, I suppose we have to call wc twice to get the total number of numlines too (which makes comparing even more complicated, cause the second wc would need reading the file again)

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

Re^3: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 21, 2016 at 20:42 UTC

Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

## FS cache purged inside Linux and on Mac OS X before running.

         wc -l : 1.732 secs.  from virtual disk
       grep -c : 1.912 secs.  from virtual disk
         total : 3.644 secs.

         wc -l : 1.732 secs.  from virtual disk
       grep -c : 0.884 secs.  from FS cache
         total : 2.616 secs.

   Perl script : 3.910 secs.  non-MCE      using 1 core

                  MCE 1.705    MCE 1.706
      with MCE : 4.357 secs.  4.015 secs.  using 1 core
      with MCE : 3.228 secs.  2.979 secs.  using 2 cores
      with MCE : 2.884 secs.  2.624 secs.  using 3 cores
      with MCE : 2.908 secs.  2.501 secs.  using 4 cores

## Dictionary2GB.txt residing inside FS cache on Linux.

         wc -l : 1.035 secs.
       grep -c : 0.866 secs.
         total : 1.901 secs.

   Perl script : 2.314 secs.  non-MCE      using 1 core

                  MCE 1.705    MCE 1.706
      with MCE : 2.344 secs.  2.337 secs.  using 1 core
      with MCE : 1.349 secs.  1.345 secs.  using 2 cores
      with MCE : 0.961 secs.  0.932 secs.  using 3 cores
      with MCE : 0.820 secs.  0.775 secs.  using 4 cores
[download]

On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

Below, the serial code and MCE code respectively.

use strict;
use warnings;

my $size = 24 * 1024 * 1024;
my ( $numlines, $occurances ) = ( 0, 0 );

open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!";

while ( read( $fh, my $b, $size ) ) {
   $b .= <$fh> unless ( eof $fh );
   $numlines   += $b =~ tr/\n//;
   $occurances += () = $b =~ /123456\r?$/mg;
}

close $fh;

print "Num lines : $numlines\n";
print "Occurances: $occurances\n";
[download]

Using MCE for running on multiple cores.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '24m', max_workers => 4,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   my $numlines = $$chunk_ref =~ tr/\n//;
   my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;

   $counter1->incrby( $numlines );
   $counter2->incrby( $occurances );

}, "/home/mario/Dictionary2GB.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

Kind regards, Mario.

[reply]
[d/l]
[select]

Re^4: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 23, 2016 at 03:15 UTC

Re^4: How to optimize a regex on a large file read line by line ?

by Anonymous Monk on Apr 21, 2016 at 21:00 UTC

Re^5: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 22, 2016 at 07:04 UTC

Re^3: How to optimize a regex on a large file read line by line ?

by marioroy (Prior) on Apr 21, 2016 at 16:53 UTC

Update: Added serial code. Am happy that IO in MCE is not too far behind. One day, will try another technique. IO aside, any CPU intensive operations such as regex do benefit from running with multiple workers.

Yes, IO will only go as fast as the underlying IO capabilities. MCE does sequential IO, meaning only one worker reads at any given time. The regex operation benefits from having multiple workers. Eventually, IO becomes the bottleneck.

 1 worker:   9.437 secs.
 2 workers:  4.480 secs.
 3 workers:  3.248 secs.
 4 workers:  3.236 secs.
 8 workers:  3.240 secs.
[download]

Below, removed counting and regex from the equation and running with 1 worker. It completes as fast as IO allows in 3.256 seconds.

mce_flow_f {
   chunk_size => '24m', max_workers => 1,
   use_slurpio => 1,
},
sub {

}, 'Dictionary2GB.txt';
[download]

The following serial code, reader only and without MCE, takes 2.864 seconds to read directly from the PCIe-based SSD drive, not from FS cache.

use strict;
use warnings;

my $size = 24 * 1024 * 1024;

open my $fh, '<', 'Dictionary2GB.txt' or die "$!";

while ( read( $fh, my $b, $size ) ) {
   $b .= <$fh>;
}

close $fh;
[download]

[reply]
[d/l]
[select]

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://1160637]
Front-paged by Arunbear
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others sharing their wisdom with the Monastery: (3)

As of 2024-04-20 04:57 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found