speed up one-line "sort|uniq -c" perl code

relaxed137 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: speed up one-line "sort\|uniq -c" perl code by l2kashe (Deacon) on Apr 10, 2003 at 01:49 UTC
In the testing I did, apparently an inline split is faster than using the -F command line arg... here's my results # File stats ;) code/monks> wc -l input.list 5202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\\|/))[9]}++' input.list 846 10.155.240.2 943 10.155.240.3 3413 10.155.240.1 0.160u 0.000s 0:00.15 106.6% 0+0k 0+0io 339pf+0w # your way code/monks> time perl -n -a -F\\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 846 10.155.240.2 943 10.155.240.3 3413 10.155.240.1 0.310u 0.000s 0:00.30 103.3% 0+0k 0+0io 336pf+0w [download] As the file got bigger the gap widened.. I started with maybe 100 lines or so, and the difference was non-existant, but once I hit 1000, my way was maybe 25% faster, at 5200 lines my way appears to be 50% faster... While I was posting I decided to go ahead and create a really big file just for fun.. heres the output # again stats... code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2477055 2003-04-09 21:42 input.list code/monks> wc -l input.list 100202 input.list # my way code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\\|/))[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 10.155.240.1 2.720u 0.020s 0:02.73 100.3% 0+0k 0+0io 339pf+0w # your way.. code/monks> time perl -n -a -F\\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 10.155.240.1 5.610u 0.010s 0:05.62 100.0% 0+0k 0+0io 336pf+0w [download] So it doesnt look like it matters after a certain size, but at least it shaves that little bit extra off.. ;) Update: Im noticing that you aren't actually sorting anything in perl.. it could be that your time is getting chewed up however you are doing your sorting.. I would move from a one liner to say.. #!/usr/bin/perl # update2: this is parse.pl while (<>) { $f = ( split('\\|') )[9]; # this needs to get optomized.. ($data{$f}->[0] = $f) =~ s/\.//g unless( $data{$f} ); $data{$f}->[1]++; } for ( sort { $data{$a}->[0] <=> $data{$b}->[0] } keys %data ) { print "$data{$_}->[1]\t$_\n"; } Sample run, Note after each run I go through and change one of the lin +es so that we arent running into OS caching. # file statscode/monks> wc -l input.list + 100202 input.list code/monks> ls -l input.list -rw-r--r-- 1 ericmc users 2410847 2003-04-09 22:11 input.list # and the first way code/monks> time perl -n -a -F\\\\| -e'END { print "$v\t$k\n" while($k, +$v)=each%h} $h{$F[9]}++' input.list 16046 10.155.240.2 17948 10.155.240.3 66208 39.39.39.39 5.490u 0.000s 0:05.49 100.0% 0+0k 0+0io 336pf+0w # now for my first iteration without sorting.. code/monks> time perl -n -a -e'END { print "$v\t$k\n" while($k,$v)=eac +h%h} $h{(split(/\\|/))[9]}++' input.list 17948 10.155.240.3 66208 39.39.39.39 16046 40.40.40.40 2.750u 0.000s 0:02.74 100.3% 0+0k 0+0io 339pf+0w # and now with the code from above... code/monks> time ./parse.pl input.list 66208 39.39.39.39 16046 40.40.40.40 17948 41.41.41.41 2.190u 0.010s 0:02.26 97.3% 0+0k 0+0io 349pf+0w [download] YMMV and you may want to sort on the count of hits as opposed to the IP, but thats a simple replace in the sort.. /* And the Creator, against his better judgement, wrote man.c */	[reply] [d/l] [select]
Re: speed up one-line "sort\|uniq -c" perl code by tachyon (Chancellor) on Apr 10, 2003 at 01:44 UTC
Essentially you have: `#!/usr/bin/perl my %h; my @res; # declare outside the loop open F, $ARGV[0] or die $!; while(<F>) { @res = split /\\|/; $h{$res[9]}++; } close F; print "$h{$_}\t$_\n" for keys %h;` [download] You may find this faster (maybe not) as we declare the @res array outside the loop which avoids creating an anonymous array to hold the split values, and then destroying it after each iteration of the loop. I suspect this may be happening in your one liner as even on a dog slow old PII 233 I have similar scripts that will parse megabyte squid/httpd log files in a few seconds - definitely not minutes. Alternatively slurping the file into memory and using a regex to process it in one pass may be faster it you have a major disk IO issue as the choke rather than CPU cycles. This is a spend more memory to get more speed approach. It is typically much faster to read in large chunks of data than iterate over a file linewise. `#!/usr/bin/perl { local $/; open F, $ARGV[0] or die $!; $all = <F>; close F; @ips = $all =~ m/^(?:[^\\|]+\\|){9}([^\\|]+)/gm; undef $all; # try to reclaim the memory } my %h; $h{$_}++ for @ips; print "$h{$_}\t$_\n" for keys %h;` [download] If neither of these work code it in C or get better hardware or just make a cup of coffee while it runs or nice it up to the max so it takes over the entire system. cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l] [select]
Re: speed up one-line "sort\|uniq -c" perl code by perrin (Chancellor) on Apr 10, 2003 at 02:32 UTC
I know this isn't really the answer you were looking for, but I think you should consider getting a better sort program. I don't think there are any limitations on the GNU one, and it is typically faster than doing the same thing in Perl.	[reply]
Re^2: speed up one-line "sort\|uniq -c" perl code (speed) by tye (Sage) on Apr 10, 2003 at 16:47 UTC
sort needs both time and space to perform the sort no matter how cleverly implemented. I find it hard to imagine a system that is so poorly configured that it can't handle sorting a paultry 500kB file. But I don't think that really matters in this particular case. There is a reason that "sort -u" came to be. It is much slower to sort all 57000 instances of several IPs and then throw all but one of each away. So I think "sort \| uniq -c" would be much slower than using Perl. Unfortunately, it doesn't appear that even GNU sort has bothered to implement a -u option that counts the duplicates. - tye	[reply]
Re: Re^2: speed up one-line "sort\|uniq -c" perl code (speed) by relaxed137 (Acolyte) on Apr 10, 2003 at 17:53 UTC
Thanks for making me realize a typo. The file that I am parsing is 500MB, not 500kB.... That's why sort freaks out.	[reply]
Re: speed up one-line "sort\|uniq -c" perl code by Jasper (Chaplain) on Apr 10, 2003 at 12:57 UTC
Just a silly optimisation, really, and I doubt one where your code is losing out greatly, but: `print $_ , $i++ % 2 ? "\n" : "\t" for %h;` [download] Is a bit faster than `each`ing through the hash. Jasper	[reply] [d/l] [select]
Re: speed up one-line "sort\|uniq -c" perl code by BrowserUk (Patriarch) on Apr 11, 2003 at 04:10 UTC
In some fairly crude testing on a test file of 500,000 randomly generated lines (18MB) to fit the pattern of data you described, this seems to run about 400% quicker. 40 seconds rather 160. `no warnings; $^W=0; open my $in, $ARGV[0] or die "Couldn't open $ARGV[0]:$!"; my ($buffer, %h) = ''; keys %h = 1024*500; while (sysread($in, $buffer, 16384, length $buffer)) { $h{$1}++ while $buffer =~ m[^(?:.+?\\|){9}([^\|]+)\\|]mg; $buffer = substr($buffer, rindex($buffer, "\n")); } print scalar keys %h;` [download] You may need to adjust the regex, and you might squeeze some more out by playing with the read size and/or the hash preallocation. In my tests, I had variable results from both, but the differences were of within the bounds of run-to-run error. Especially the latter which I am not really sure how the number of buckets relates to keys. Examine what is said, not who speaks. 1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong. 2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible 3) Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke.	[reply] [d/l]
Re^2speed up one-line "sort\|uniq -c" perl code by relaxed137 (Acolyte) on Apr 16, 2003 at 00:28 UTC
this looks like it is going to end up being faster. I'm doing some testing with speed, but it looks like your numbers are going to be comparible with mine. THANKS!	[reply]
Re: speed up one-line "sort\|uniq -c" perl code by maksl (Pilgrim) on Apr 14, 2003 at 19:46 UTC
you would gain a speed by using lesser processes in your bash one liner,instead of "sort \|uniq": `sort -u` [download] try to use nl for the line number: `sort -u \| nl -ba` [download] Update: bad answer .. this question was not asked .. thx Aristotle	[reply] [d/l] [select]
Re^2: speed up one-line "sort\|uniq -c" perl code (not the OP's question) by Aristotle (Chancellor) on Apr 14, 2003 at 20:27 UTC
Uhm, he's not at all interested in a line number, what he wants is an occurence count per word. Makeshifts last the longest.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks