Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: How to optimize a regex on a large file read line by line ?

by graff (Chancellor)
on Apr 16, 2016 at 16:29 UTC ( [id://1160654]=note: print w/replies, xml ) Need Help??


in reply to How to optimize a regex on a large file read line by line ?

Not that this would make a big difference in terms of run-time, but you don't have to keep your own counter for the number of lines in the file. The predefined global variable $. does that for you (cf. the perlvar man page):
print "Num. Line : $. - Occ : $counter2\n";
A few other observations...

I fetched the "10-million-combos.txt.zip" file you cited in one of the replies above, and noticed that it contains just the one text file. In terms of benchmarking, you might find that a command-line operation like this:

unzip -p 10-million-combos.txt.zip | perlscript
is likely to be faster than having the perl script read an uncompressed version of the file from disk, because piping output from "unzip -p" involves fetching just 23 MB from disk, as opposed to 112 MB to read the uncompressed version. (Disk access time is always a factor for stuff like this.)

Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/. UPDATE: actually, there would be 2 matches on a windows system, and I find those two on my machine if I search for /123456\r\n$/.

I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline, assuming that this would be the fastest possible way to do your regex search-and-count, but then I tried it out on your actual data and got a surprise (running on a macbook pro, osx 10.10.5, 2.2GHz intel core i7, 4GB ram):

$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.30 real 3.25 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.23 real 3.22 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time grep -c 123456$ 0 3.18 real 3.17 user 0.01 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.96 real 1.89 user 0.02 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.96 real 1.93 user 0.02 sys $ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++ + if /123456$/; END{print "$. lines, $n matches\n"}' 9835513 lines, 0 matches 1.93 real 1.90 user 0.02 sys
I ran each command three times in rapid succession, to check for timing differences due to system cache behavior and other unrelated variables. Perl is consistently faster by about 33% (and can report total line count along with match count, which the grep utility cannot do).

(If I remove the "$" from the regex, looking for 123456 anywhere on any line, I find three matches, and the run times are just a few percent longer overall.)

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 16, 2016 at 17:55 UTC

    "The predefined global variable $. does that for you"

    Wasn't aware of this trick, thanks !

    "Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/."

    Hahem, sound like i've done something wrong while zipping the file. Now the 19x mb file containing 10 millions password are updated in the right way. You will find 10000000 lines in it, and 61466 with the regex 123456$.

    "unzip -p 10-million-combos.txt.zip | perlscript"

    Currently i'm working on txt file only. But it's interesting. I've done your test like that :

    echo 1:%time% unzip -p 10-million-combos.zip | grep 123456$ | wc -l echo 2:%time% grep 123456$ 10-million-combos.txt | wc -l echo 3:%time% pause

    Result :

    1:19:16:46,11 61466 2:19:16:48,43 61466 3:19:16:49,00

    0,58 in plaintext, 2,27 in zip file piped.

    More now with your command line

    zip piped : 3,89 unzip -p "C:\Users\admin\Desktop\10-million-combos.zip" | perl -ne "BE +GIN{$n=0} $n++ if /123456$/; END{print $n}" plain text : 5,16 type "C:\Users\admin\Desktop\10-million-combos.txt" | perl -ne "BEGIN{ +$n=0} $n++ if /123456$/; END{print $n}") perl direct : 2,29 perl "demo.pl"

    =Fastest on my side stay the direct access to the plain text file either using grep or perl. Amazing to see the perl unzip goes faster than the plain text access with an inline command... The shell is strange sometimes...

    "I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline"

    Im' using the one you can find in the unix utils, i suppose it's the GNU one ported on windows. --version give me : grep (GNU grep) 2.4.2.

    Now grep vs perl
    echo %time%& grep 123456$ C:\Users\admin\Desktop\10-million-combos.txt + | wc -l& echo %time% echo %time%& type "C:\Users\admin\Desktop\10-million-combos.txt" | per +l -ne "BEGIN{$n=0} $n++ if /123456$/; END{print $n}"& echo.&echo %tim +e% echo %time%& perl demo.pl& echo %time%

    Give me :

    19:43:28,91/61466/19:43:29,51 for grep (0,6) 19:45:29,51/61466/19:45:34,71 for perl (5,2) 19:46:13,27/61466/19:46:15,47 for perl (direct) (2,2)
      Thanks for showing your comparison of the unzip pipeline vs. reading uncompressed text. I had said that the former would be faster (because of less reading from disk), but without actually testing it. (I think I must have encountered at least a couple situations in the past where some process finished more quickly if I read compressed data from disk, rather than uncompressed, but I don't know what may have been different in those cases.)

      Having now tested it for this situation (multiple times in quick succession to check for consistency), the difference in timing was negligible or slightly favoring reading the uncompressed file, so it seems my initial idea about the role of disk access was wrong: either it really doesn't make any difference, or else whatever difference it makes is washed out by the added overhead of the extra unzip process and/or the pipeline itself.

      (The perl one-liner was still faster than the compiled "grep" utility on my machine, but YMMV - different machines will have different versions / compilations of both Perl and grep.)

Re^2: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 16, 2016 at 18:00 UTC

    I think the matter come from huge file. How many times took on your computer the same request on the 1,9 Gb dictionnary ?

    http://mab.to/tbT8VsPDm

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1160654]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-19 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found