Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Sorting IP addresses, lots of them, quickly

by gloryhack (Deacon)
on May 16, 2007 at 07:55 UTC ( [id://615707]=CUFP: print w/replies, xml ) Need Help??

Got a huge pile of IP addresses to sort and don't have all day to do it?

Update This is just a variation on the GRT, really, but it's a tad faster than the one in their paper.

use Socket; sub gloryhackish { my @in = @addresses; my @sorted = map { $_ = inet_ntoa($_) } (sort (map {$_ = inet_aton($_)} @in)); return @sorted; } __END__ Testing with 11818 addresses from my local blacklist: Benchmark: running gloryhackish, grt, schwartzian for at least 10 CPU +seconds... gloryhackish: 11 wallclock secs (10.50 usr + 0.06 sys = 10.56 CPU) @ +22.25/s (n=235) grt: 11 wallclock secs (10.57 usr + 0.03 sys = 10.60 CPU) @ 16.32/s ( +n=173) schwartzian: 11 wallclock secs (11.02 usr + 0.02 sys = 11.04 CPU) @ +8.42/s (n=93) Rate schwartzian grt gloryhackish schwartzian 8.42/s -- -48% -62% grt 16.3/s 94% -- -27% gloryhackish 22.3/s 164% 36% --

Replies are listed 'Best First'.
Re: Sorting IP addresses, lots of them, quickly
by salva (Canon) on May 16, 2007 at 09:52 UTC
    and faster...
    use Benchmark qw(cmpthese); use Socket; use Sort::Key qw(ukeysort keysort); my $n = 1000000; my @address = map { join ('.', map {int rand 256 } 0..3) } 0..$n; print "\@address populated\n"; sub gloryhackish { my @in = @address; my @sorted = map { $_ = inet_ntoa($_) } (sort (map {$_ = inet_aton($_)} @in)); } sub ks { my @sorted = keysort { inet_aton($_) } @address; } sub uks { my @sorted = ukeysort { /^(\d+)\.(\d+)\.(\d+)\.(\d+)$/; ($1 << 24) + ($2 << 16) + ($3 << 8) + $4 } @ +address; } cmpthese -1, { gloryhackish => \&gloryhackish, keysort => \&ks, ukeysort => \&uks };
    on my PC, the results for 1 million elements are:
    s/iter gloryhackish keysort ukeysort gloryhackish 24.4 -- -29% -48% keysort 17.2 42% -- -26% ukeysort 12.7 92% 35% --
      Ooh, I like it! ++ and then some!

      I had to install Sort::Key to give it a whirl, and I can't manage to get the same 92% number, but I still like it just fine. Here's what I get, best of three, with ST and GRT thrown into the mix:

      @address populated Benchmark: running gloryhackish, grt, ks, schwartzian, uks for at leas +t 30 CPU seconds... gloryhackish: 34 wallclock secs (33.28 usr + 0.19 sys = 33.47 CPU) @ + 0.12/s (n=4) grt: 34 wallclock secs (34.62 usr + 0.06 sys = 34.68 CPU) @ 0 +.12/s (n=4) ks: 34 wallclock secs (34.04 usr + 0.03 sys = 34.07 CPU) @ 0 +.18/s (n=6) schwartzian: 55 wallclock secs (54.62 usr + 0.11 sys = 54.73 CPU) @ +0.04/s (n=2) (warning: too few iterations for a reliable count) uks: 36 wallclock secs (35.53 usr + 0.24 sys = 35.77 CPU) @ 0 +.17/s (n=6) s/iter schwartzian grt gloryhackish uks + ks schwartzian 27.4 -- -68% -69% -78% + -79% grt 8.67 216% -- -3% -31% + -35% gloryhackish 8.37 227% 4% -- -29% + -32% ukeysort 5.96 359% 45% 40% -- + -5% keysort 5.68 382% 53% 47% 5% + --

      Still, even at 47% I'm suitably impressed and will be using keysort. It strikes me as odd that in your results keysort was bested by ukeysort... hardware differences, perhaps? (Mine's an AMD64 X2 4400+ w/2G DDR2 RAM.)

      Thanks for pointing Sort::Key out to me!

      Edit: Running on my original list of 11818 addresses, gloryhackish comes in 31% faster than ukeysort but 47% slower than keysort. Hmmmm.... curious, but still convincing.

Re: Sorting IP addresses, lots of them, quickly
by jdporter (Paladin) on May 16, 2007 at 15:26 UTC

    Strange... No matter what I do, I don't get benchmark results that show gloryhackish performing anywhere near as well as grt. Maybe it's just hardware...

    Anyway, try throwing this into the mix:

    sub inplace { my @in = @address; $_ = inet_aton($_) for @in; @in = sort @in; $_ = inet_ntoa($_) for @in; my @sorted = @in; }

    For me, this performs almost as well (but not quite) as grt. It's probably the assignment from sort that's the killer. If only perl had an in-place sort option...

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
      If only perl had an in-place sort option...

      Actually it has, @in = sort @in; is optimized as an in-place sort!

        Do you have any documentation for that? The sort manpage doesn't mention any such thing.

        A word spoken in Mind will reach its own level, in the objective world, by its own weight
      I'm seeing what appears to be a hardware dependence, too. On an old dual P-II/450 with 1G of RAM, I get (with my original 11818 address list):
      Rate ukeysort grt schwartzian gloryhackish + keysort ukeysort 1.63/s -- -20% -21% -39% + -70% grt 2.04/s 25% -- -0% -23% + -62% schwartzian 2.05/s 26% 0% -- -23% + -62% gloryhackish 2.66/s 63% 30% 30% -- + -51% keysort 5.41/s 232% 165% 163% 103% + --

      Somehow, ukeysort went to the back of the line, with GRT not being a whole lot zippier and inconsequentially slower than ST. On a similar machine with just a single P-II/500 and 640M of RAM, though:

      Rate schwartzian ukeysort grt gloryhackish + keysort schwartzian 2.28/s -- -22% -27% -42% + -71% ukeysort 2.92/s 28% -- -7% -25% + -63% grt 3.14/s 38% 8% -- -20% + -60% gloryhackish 3.91/s 72% 34% 25% -- + -50% keysort 7.86/s 245% 169% 151% 101% + --

      GRT beats ukeysort again and is well ahead of ST. But on an Athlon 3900+ w/1G RAM:

      Rate schwartzian grt ukeysort gloryhackish + keysort schwartzian 6.83/s -- -37% -46% -52% + -76% grt 10.9/s 59% -- -13% -24% + -61% ukeysort 12.6/s 84% 16% -- -12% + -55% gloryhackish 14.2/s 108% 31% 13% -- + -50% keysort 28.2/s 312% 159% 124% 98% + --

      None of these machines are memory constrained, none are swapping, and none are particularly busy doing other things. All are running the same version of Debian with the stock kernels, running the exact same test code just copied from machine to machine. keysort is always out front, but unlike my AMD64 the gloryhackish stays in the #2 spot every time, while the others wiggle around depending upon the machine used.

      'Tis a puzzle, seems to point toward something at a very low level being the source of the disparate numbers. The only conclusion I can come up with is that keysort is the way to go when a large number of IP addresses must be sorted in a time constrained environment, gloryhackish might be a good choice if no non-core module dependencies are allowed, even though it might not always be the second best choice. It's good to know about Sort::Key, in any event.

        The ukeysort variant is slower than the others because converting from the IPv4 dot notation to an unsigned int in Perl is much, much, much, slower than calling inet_aton. If an XSUB were used to make the conversion ukeysort would be the fastest!

        But anyway, as the data set size increases, the conversion cost from IPv4 to uint would become less decisive than the comparison function used inside the sort, and at some point the ukeysort solution would become the fastest of all the alternatives (and BTW, it also uses less memory! :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://615707]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-03-28 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found