comment on

Not that this would make a big difference in terms of run-time, but you don't have to keep your own counter for the number of lines in the file. The predefined global variable $. does that for you (cf. the perlvar man page):

print "Num. Line : $. - Occ : $counter2\n";
[download]

A few other observations...

I fetched the "10-million-combos.txt.zip" file you cited in one of the replies above, and noticed that it contains just the one text file. In terms of benchmarking, you might find that a command-line operation like this:

unzip -p 10-million-combos.txt.zip | perlscript
[download]

is likely to be faster than having the perl script read an uncompressed version of the file from disk, because piping output from "unzip -p" involves fetching just 23 MB from disk, as opposed to 112 MB to read the uncompressed version. (Disk access time is always a factor for stuff like this.)

Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/. UPDATE: actually, there would be 2 matches on a windows system, and I find those two on my machine if I search for /123456\r\n$/.

I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline, assuming that this would be the fastest possible way to do your regex search-and-count, but then I tried it out on your actual data and got a surprise (running on a macbook pro, osx 10.10.5, 2.2GHz intel core i7, 4GB ram):

$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.30 real         3.25 user         0.01 sys
$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.23 real         3.22 user         0.01 sys
$ unzip -p 10-million-combos.txt.zip | time grep -c 123456$
0
        3.18 real         3.17 user         0.01 sys

$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.96 real         1.89 user         0.02 sys
$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.96 real         1.93 user         0.02 sys
$ unzip -p 10-million-combos.txt.zip | time perl -ne 'BEGIN{$n=0} $n++
+ if /123456$/; END{print "$. lines, $n matches\n"}'
9835513 lines, 0 matches
        1.93 real         1.90 user         0.02 sys
[download]

I ran each command three times in rapid succession, to check for timing differences due to system cache behavior and other unrelated variables. Perl is consistently faster by about 33% (and can report total line count along with match count, which the grep utility cannot do).

(If I remove the "$" from the regex, looking for 123456 anywhere on any line, I find three matches, and the run times are just a few percent longer overall.)

In reply to Re: How to optimize a regex on a large file read line by line ? by graff
in thread How to optimize a regex on a large file read line by line ? by John FENDER

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


The stupid question is the question not asked
	PerlMonks