Your skill will accomplishwhat the force of many cannot PerlMonks

### Fastest way to lookup a point in a set

by eyepopslikeamosquito (Bishop)
 on Aug 05, 2017 at 05:47 UTC Need Help??

eyepopslikeamosquito has asked for the wisdom of the Perl Monks concerning the following question:

I have a large set of points. Each point has an x and y coordinate. The range of values for x and y are -2,147,483,648 to 2,147,483,647 (i.e. 32-bit signed int). I am running a 64-bit version of Perl; this code is not required to run in a 32-bit environment. I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.

The most obvious way I could think of to do this is in Perl is with a hash. My first attempt is shown below. For the unique hash key, I tried three different ways to do it:

• String: \$x . ':' . \$y
• pack/unpack: pack "ii", \$x, \$y
• 64-bit int: (\$y << 32) | (\$x & 0xFFFFFFFF)
The string was the fastest in my simple benchmark shown below.

```Benchmark: timing 200000 iterations of Big, Pak, Str...
Big: 6 wallclock secs ( 5.64 usr + 0.00 sys = 5.64 CPU) @ 35454.71/s
+(n=200000)
Pak: 4 wallclock secs ( 4.47 usr + 0.00 sys = 4.47 CPU) @ 44752.74/s
+(n=200000)
Str: 4 wallclock secs ( 3.95 usr + 0.00 sys = 3.95 CPU) @ 50594.49/s
+(n=200000)

Suggestions for a faster lookup are welcome.

```use strict;
# use warnings;
use Benchmark qw(timethese);

my @points = (
[  0,  0 ],
[ -1, -2 ],
[  1,  2 ],
[ -1,  2 ],
[  1, -2 ],
[  0,  1 ],
[  1,  0 ],
[ -1,  0 ],
[  0, -1 ],
[  2147483647,  2147483647 ],
[  2147483647, -2147483647 ],
[ -2147483647,  2147483647 ],
[ -2147483647, -2147483647 ],
[  -1,  2147483647 ],
[  2147483647, 1 ],
[  -2,  2147483646 ],
[  1,   -2147483647 ],
[  1234561,  1234562 ],
[  1234563,  -1234564 ],
[  -1234565,  1234566 ],
[  -1234567,  -1234568 ],
[  10,  11 ],
[  11,  12 ],
[  12,  13 ],
[  13,  14 ],
[  14,  15 ],
[  15,  16 ],
[  16,  17 ],
[  17,  18 ],
[  18,  19 ],
[  19,  20 ],
[  1001,  1002 ],
[  1003,  1004 ],
[  1005,  1006 ],
[  1007,  1008 ],
[  1009,  1010 ],
[  1011,  1012 ],
[  1013,  1014 ],
[  1015,  1016 ],
[  1017,  1018 ],
[  1019,  1020 ],
[  -1001,  -1002 ],
[  -1003,  -1004 ],
[  -1005,  -1006 ],
[  -1007,  -1008 ],
[  -1009,  -1010 ],
[  -1011,  -1012 ],
[  -1013,  -1014 ],
[  -1015,  -1016 ],
[  -1017,  -1018 ],
[  -1019,  -1020 ],
[  99910,  99911 ],
[  99911,  99912 ],
[  99912,  99913 ],
[  99913,  99914 ],
[  99914,  99915 ],
[  99915,  99916 ],
[  99916,  99917 ],
[  99917,  99918 ],
[  99918,  99919 ],
[  99919,  99920 ],
[  -99910,  -99911 ],
[  -99911,  -99912 ],
[  -99912,  -99913 ],
[  -99913,  -99914 ],
[  -99914,  -99915 ],
[  -99915,  -99916 ],
[  -99916,  -99917 ],
[  -99917,  -99918 ],
[  -99918,  -99919 ],
[  -99919,  -99920 ],
[  1099910,  1099911 ],
[  1099911,  1099912 ],
[  1099912,  1099913 ],
[  1099913,  1099914 ],
[  1099914,  1099915 ],
[  1099915,  1099916 ],
[  1099916,  1099917 ],
[  1099917,  1099918 ],
[  1099918,  1099919 ],
[  1099919,  1099920 ],
[  -1099910,  -1099911 ],
[  -1099911,  -1099912 ],
[  -1099912,  -1099913 ],
[  -1099913,  -1099914 ],
[  -1099914,  -1099915 ],
[  -1099915,  -1099916 ],
[  -1099916,  -1099917 ],
[  -1099917,  -1099918 ],
[  -1099918,  -1099919 ],
[  -1099919,  -1099920 ],
[  91099910,  91099911 ],
[  91099911,  91099912 ],
[  91099912,  91099913 ],
[  91099913,  91099914 ],
[  91099914,  91099915 ],
[  91099915,  91099916 ],
[  91099916,  91099917 ],
[  91099917,  91099918 ],
[  91099918,  91099919 ],
[  91099919,  91099920 ],
[  -91099910,  -91099911 ],
[  -91099911,  -91099912 ],
[  -91099912,  -91099913 ],
[  -91099913,  -91099914 ],
[  -91099914,  -91099915 ],
[  -91099915,  -91099916 ],
[  -91099916,  -91099917 ],
[  -91099917,  -91099918 ],
[  -91099918,  -91099919 ],
[  -91099919,  -91099920 ],
[  491099910,  491099911 ],
[  491099911,  491099912 ],
[  491099912,  491099913 ],
[  491099913,  491099914 ],
[  491099914,  491099915 ],
[  491099915,  491099916 ],
[  491099916,  491099917 ],
[  491099917,  491099918 ],
[  491099918,  491099919 ],
[  491099919,  491099920 ],
[  -491099910,  -491099911 ],
[  -491099911,  -491099912 ],
[  -491099912,  -491099913 ],
[  -491099913,  -491099914 ],
[  -491099914,  -491099915 ],
[  -491099915,  -491099916 ],
[  -491099916,  -491099917 ],
[  -491099917,  -491099918 ],
[  -491099918,  -491099919 ],
[  -491099919,  -491099920 ],
);
my \$npoints = @points;

sub str_hash {
# print "string_hash---------------\n";
my %cells;
# insert the points into the hash
for my \$p (@points) {
my \$h = \$p->[0] . ':' . \$p->[1];
my ( \$x, \$y ) = split ':', \$h;
# print "x='\$p->[0]' y='\$p->[1]' h='\$h' (x='\$x' y='\$y')\n";
if (\$x != \$p->[0]) { die; }
if (\$y != \$p->[1]) { die; }
\$cells{\$h} = undef;
# ++\$cells{\$h};
}
scalar(keys %cells) == \$npoints or die;
# lookup each points in the hash
for my \$p (@points) {
my \$h = \$p->[0] . ':' . \$p->[1];
exists \$cells{\$h} or die;
}
exists \$cells{'notfound'} and die;
exists \$cells{'notfound2'} and die;
exists \$cells{'notfound3'} and die;
return \%cells;
}

sub big_hash {
# print "bigint_hash---------------\n";
my %cells;
# insert the points into the hash
for my \$p (@points) {
my \$h = (\$p->[1] << 32) | (\$p->[0] & 0xFFFFFFFF);
my \$x =  \$h & 0x00000000FFFFFFFF;
my \$y = (\$h & 0xFFFFFFFF00000000) >> 32;
if (\$x >> 31) { \$x -= 0xFFFFFFFF + 1 }
if (\$y >> 31) { \$y -= 0xFFFFFFFF + 1 }
# print "x='\$p->[0]' y='\$p->[1]' h='\$h' (x='\$x' y='\$y')\n";
if (\$x != \$p->[0]) { die; }
if (\$y != \$p->[1]) { die; }
\$cells{\$h} = undef;
# ++\$cells{\$h};
}
scalar(keys %cells) == \$npoints or die;
# lookup each point in the hash
for my \$p (@points) {
my \$h = (\$p->[1] << 32) | (\$p->[0] & 0xFFFFFFFF);
exists \$cells{\$h} or die;
}
exists \$cells{'notfound'} and die;
exists \$cells{'notfound2'} and die;
exists \$cells{'notfound3'} and die;
return \%cells;
}

sub pak_hash {
# print "unpack_hash---------------\n";
my %cells;
# insert the points into the hash
for my \$p (@points) {
my \$h = pack "ii", \$p->[0], \$p->[1];
my ( \$x, \$y ) =  unpack "ii", \$h;
# print "x='\$p->[0]' y='\$p->[1]' h='\$h' (x='\$x' y='\$y')\n";
if (\$x != \$p->[0]) { die; }
if (\$y != \$p->[1]) { die; }
\$cells{\$h} = undef;
# ++\$cells{\$h};
}
scalar(keys %cells) == \$npoints or die;
# lookup each point in the hash
for my \$p (@points) {
my \$h = pack "ii", \$p->[0], \$p->[1];
exists \$cells{\$h} or die;
}
exists \$cells{'notfound'} and die;
exists \$cells{'notfound2'} and die;
exists \$cells{'notfound3'} and die;
return \%cells;
}

sub str_look {
my \$cells = shift;
for my \$p (@points) {
# my \$h = \$p->[0] . ':' . \$p->[1];
exists \$cells->{\$p->[0] . ':' . \$p->[1]} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

sub big_look {
my \$cells = shift;
for my \$p (@points) {
# my \$h = (\$p->[1] << 32) | (\$p->[0] & 0xFFFFFFFF);
exists \$cells->{(\$p->[1] << 32) | (\$p->[0] & 0xFFFFFFFF)} or die
+;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

sub pak_look {
my \$cells = shift;
for my \$p (@points) {
# my \$h = pack "ii", \$p->[0], \$p->[1];
exists \$cells->{pack "ii", \$p->[0], \$p->[1]} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

my \$str_ref = str_hash();
my \$big_ref = big_hash();
my \$pak_ref = pak_hash();

timethese 200000, {
Str => sub { str_look(\$str_ref) },
Big => sub { big_look(\$big_ref) },
Pak => sub { pak_look(\$pak_ref) },
};

References

Replies are listed 'Best First'.
Re: Fastest way to lookup a point in a set
by BrowserUk (Pope) on Aug 05, 2017 at 06:16 UTC

For pure speed, nothing will beat Perl's hashes. You can do it faster in C, but the penalty of calling in and out from Perl will alway slow you down whatever you try.

The real problem is that you'll rapidly run out of space at which point you'll compromise from fastest, to get fastest with less space.

At that point you'll be looking at some form of sparse bitmap of which there are many algorithms but none that cannot be undone by uncooperative data, so you need to get to know the distribution of your data and tailor your algorithms accordingly. If your data is truly random, then you may be out of luck.

A supersearch for my name and judy arrays will turn up a compact, single file version of the Judy array code that compiled clean and worked very well for that application. Still slower than hashes, but far more compact. Just hope you don't find any bugs, because the Judy code is horribly complex.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Fastest way to lookup a point in a set
by marioroy (Parson) on Aug 05, 2017 at 06:57 UTC

Hello eyepopslikeamosquito,

From testing using Perl 5.18.2, join(':', @\$p) runs faster than \$p->[0] .':'. \$p->[1]. Although, it runs only slightly faster on newer Perl releases.

```sub str_look {
my \$cells = shift;
for my \$p (@points) {
# my \$h = \$p->[0] . ':' . \$p->[1];
# exists \$cells->{\$p->[0] . ':' . \$p->[1]} or die;
exists \$cells->{ join(':', @\$p) } or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

Results:

```str_look concat 34843.21/s
str_look join   42643.92/s

Update: Added st2_look including mat_hash by kcott.

```...

sub st2_look {
my \$cells = shift;
for my \$p (@points) {
exists \$cells->{ join(':',@\$p) } or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

my \$str_ref = str_hash();
my \$st2_ref = str_hash();
my \$big_ref = big_hash();
my \$pak_ref = pak_hash();
my \$mat_ref = mat_hash();

timethese 200000, {
Str => sub { str_look(\$str_ref) },
St2 => sub { st2_look(\$st2_ref) },
Big => sub { big_look(\$big_ref) },
Pak => sub { pak_look(\$pak_ref) },
Mat => sub { mat_look(\$mat_ref) },
};

Benchmark from Perl 5.18.2 (default Perl) on Mac OS X 10.11.6.

Benchmark from Perl 5.22.3 on Mac OS X.

```config_args='-Dprefix=/opt/perl-5.22.3 -sder -Dusethreads -Accflags=-m
+sse4.2'

Benchmark from Perl 5.24.1 on Mac OS X.

```config_args='-Dprefix=/opt/perl-5.24.1 -sder -Dusethreads -Accflags=-m
+sse4.2'

Regards, Mario

Update: Tried Perl's state feature at the end of the post.

I ran under CentOS Linux release 7.3.1611, a Parallels Desktop VM, using the default Perl v5.16.3.

I also tried pre-computing the points as strings for comparison. That reaches near 100k per second or double the performance of st2_look.

```{
my @points_str;

sub st3_look {
my \$cells = shift;
unless (@points_str) {
for my \$p (@points) {
push @points_str, join(':', @\$p);
}
}
for my \$p (@points_str) {
exists \$cells->{\$p} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}
}

my \$str_ref = str_hash();
my \$st2_ref = str_hash();
my \$st3_ref = str_hash();
my \$big_ref = big_hash();
my \$pak_ref = pak_hash();
my \$mat_ref = mat_hash();

timethese 200000, {
Str => sub { str_look(\$str_ref) },
St2 => sub { st2_look(\$st2_ref) },
St3 => sub { st3_look(\$st3_ref) },
Big => sub { big_look(\$big_ref) },
Pak => sub { pak_look(\$pak_ref) },
Mat => sub { mat_look(\$mat_ref) },
};

Results from the same VM, CentOS 7.3, Perl v5.16.3.

Another way is via Perl's state feature. However, it fails in list context.

```sub st3_look {
my \$cells = shift;
state @points_str = map { join ':', @{\$_} } @points;
...
}

Initialization of state variables in list context currently forbidden
+at test.pl line 266, near "@points;"
Execution of test.pl aborted due to compilation errors.

But, works fine for an anonymous array.

```use feature qw(state);

sub st3_look {
my \$cells = shift;
state \$points_str = [ map { join ':', @{\$_} } @points ];
for my \$p (@{ \$points_str }) {
exists \$cells->{\$p} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

Regards, Mario

Re: Fastest way to lookup a point in a set
by kcott (Bishop) on Aug 05, 2017 at 07:48 UTC

G'day eyepopslikeamosquito,

I tried using a matrix: \$cells{x_coord}{y_coord}.

For comparison, here's the benchmark using your original code:

```Benchmark: timing 200000 iterations of Big, Pak, Str...
Big: 10 wallclock secs (10.36 usr +  0.02 sys = 10.38 CPU) @ 19
+267.82/s (n=200000)
Pak:  8 wallclock secs ( 9.81 usr +  0.03 sys =  9.84 CPU) @ 20
+325.20/s (n=200000)
Str:  9 wallclock secs ( 8.30 usr +  0.02 sys =  8.32 CPU) @ 24
+038.46/s (n=200000)

So, while it looks like your machine is about twice as fast as mine, the results are roughly equivalent.

I then changed the last part of your code to:

```sub mat_hash {
my %cells;

\$cells{\$_->[0]}{\$_->[1]} = undef for @points;
my \$ncells = 0;
\$ncells += keys %{\$cells{\$_}} for keys %cells;
\$ncells == \$npoints or die;
exists \$cells{\$_->[0]}{\$_->[1]} or die for @points;
exists \$cells{'notfound'} and die;
exists \$cells{'notfound2'} and die;
exists \$cells{'notfound3'} and die;
return \%cells;
}

sub mat_look {
my \$cells = shift;
exists \$cells->{\$_->[0]}{\$_->[1]} or die for @points;
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

my \$str_ref = str_hash();
my \$big_ref = big_hash();
my \$pak_ref = pak_hash();
my \$mat_ref = mat_hash();

timethese 200000, {
Str => sub { str_look(\$str_ref) },
Big => sub { big_look(\$big_ref) },
Pak => sub { pak_look(\$pak_ref) },
Mat => sub { mat_look(\$mat_ref) },
};

I'm pretty sure mat_hash and mat_look capture all the same operations as your *_hash and *_look functions; however, it wouldn't hurt for a second look at that — preferably with a non-popping eye :-)

Other than the shebang line

```#!/usr/bin/env perl

the remainder of the code is unchanged.

I ran the benchmark three times:

```Benchmark: timing 200000 iterations of Big, Mat, Pak, Str...
Big: 10 wallclock secs (10.06 usr +  0.02 sys = 10.08 CPU) @ 19
+841.27/s (n=200000)
Mat:  8 wallclock secs ( 7.51 usr +  0.01 sys =  7.52 CPU) @ 26
+595.74/s (n=200000)
Pak: 10 wallclock secs ( 9.63 usr +  0.01 sys =  9.64 CPU) @ 20
+746.89/s (n=200000)
Str:  8 wallclock secs ( 7.78 usr +  0.02 sys =  7.80 CPU) @ 25
+641.03/s (n=200000)

Benchmark: timing 200000 iterations of Big, Mat, Pak, Str...
Big: 10 wallclock secs (10.01 usr +  0.02 sys = 10.03 CPU) @ 19
+940.18/s (n=200000)
Mat:  8 wallclock secs ( 7.52 usr +  0.01 sys =  7.53 CPU) @ 26
+560.42/s (n=200000)
Pak: 10 wallclock secs ( 9.72 usr +  0.01 sys =  9.73 CPU) @ 20
+554.98/s (n=200000)
Str:  8 wallclock secs ( 8.00 usr +  0.02 sys =  8.02 CPU) @ 24
+937.66/s (n=200000)

Benchmark: timing 200000 iterations of Big, Mat, Pak, Str...
Big: 10 wallclock secs (10.05 usr +  0.01 sys = 10.06 CPU) @ 19
+880.72/s (n=200000)
Mat:  8 wallclock secs ( 7.47 usr + -0.01 sys =  7.46 CPU) @ 26
+809.65/s (n=200000)
Pak:  9 wallclock secs ( 9.66 usr +  0.02 sys =  9.68 CPU) @ 20
+661.16/s (n=200000)
Str:  9 wallclock secs ( 8.09 usr +  0.01 sys =  8.10 CPU) @ 24
+691.36/s (n=200000)

So, it looks like Mat is slightly faster than Str (the previous fastest). Your post stressed speed but, in case it matters, the increased number of keys will equate to greater memory usage.

I'm using:

```\$ perl -v | head -2 | tail -1
This is perl 5, version 26, subversion 0 (v5.26.0) built for darwin-th

— Ken

Ken:

I'd've saved myself a little time had I read the entire thread before going off and trying a double-hash. Mine was essentially the same as yours, but I was concerned about auto-vivification making the hash grow without bounds during use. When I tested with my anti-autovivification bit, I noticed a speed boost on the double-hash vs. the others when doing lookups for items that don't exist.

I modified all the lookup routines accept a check count so I could specify the number of points to check, and built a points2 array with a bunch of points that aren't in the original. So my lookup routine looks like:

```sub dbl_look {
my \$cells = shift;
my \$chk = shift;
my \$cnt = 0;
for my \$p (@points2[0 .. \$chk]) {
(exists \$cells->{\$p->[0]} and exists \$cells->{\$p->[0]}{\$p->[1]
+}) and ++\$cnt;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

I then called each of the routines with the original data, the original data plus the same amount of non-existing points, and then the original data plus twice the number of non-existing points:

```timethese 200000, {
StjA => sub { stj_look(\$stj_ref, \$npoints },
StjB => sub { stj_look(\$stj_ref, 2*\$npoints },
StjC => sub { stj_look(\$stj_ref, 3*\$npoints },
...
};

The results were:

```\$ perl pm_1196786.pl
131, keys 131
str: 131
131, keys 131
stj: 131
big: 131
pak: 131
dbl: 131
Benchmark: timing 200000 iterations of BigA, BigB, BigC, DblA, DblB, D
+blC, PakA, PakB, PakC, StjA, StjB, StjC, StrA, StrB, StrC...
BigA:  7 wallclock secs ( 6.51 usr +  0.00 sys =  6.51 CPU) @ 30
+698.39/s (n=200000)
BigB: 13 wallclock secs (12.95 usr +  0.00 sys = 12.95 CPU) @ 15
+439.25/s (n=200000)
BigC: 19 wallclock secs (18.97 usr +  0.00 sys = 18.97 CPU) @ 10
+543.52/s (n=200000)
DblA:  6 wallclock secs ( 6.50 usr +  0.00 sys =  6.50 CPU) @ 30
+773.97/s (n=200000)
DblB: 10 wallclock secs ( 9.23 usr +  0.00 sys =  9.23 CPU) @ 21
+659.09/s (n=200000)
DblC: 12 wallclock secs (11.97 usr +  0.00 sys = 11.97 CPU) @ 16
+709.83/s (n=200000)
PakA:  6 wallclock secs ( 6.31 usr +  0.00 sys =  6.31 CPU) @ 31
+685.68/s (n=200000)
PakB: 13 wallclock secs (12.45 usr +  0.00 sys = 12.45 CPU) @ 16
+060.39/s (n=200000)
PakC: 18 wallclock secs (18.34 usr +  0.00 sys = 18.34 CPU) @ 10
+902.75/s (n=200000)
StjA:  7 wallclock secs ( 6.91 usr +  0.00 sys =  6.91 CPU) @ 28
+960.32/s (n=200000)
StjB: 14 wallclock secs (13.56 usr +  0.00 sys = 13.56 CPU) @ 14
+746.00/s (n=200000)
StjC: 20 wallclock secs (19.94 usr +  0.00 sys = 19.94 CPU) @ 10
+031.10/s (n=200000)
StrA:  6 wallclock secs ( 6.36 usr +  0.00 sys =  6.36 CPU) @ 31
+446.54/s (n=200000)
StrB: 13 wallclock secs (12.67 usr +  0.00 sys = 12.67 CPU) @ 15
+782.83/s (n=200000)
StrC: 19 wallclock secs (18.69 usr +  0.00 sys = 18.69 CPU) @ 10
+702.63/s (n=200000)

Dbl fares pretty well against the others when a significant amount of lookups fail.

Full source...

...roboticus

When your only tool is a hammer, all problems look like your thumb.

That's a good point. Depending on usage and distribution, short-circuiting on "exists \$cells[\$x]" could also improve speed. In fact, knowing more about the distribution, organising the matrix as "\$cells[\$y][\$x]", instead of "\$cells[\$x][\$y]", could be another improvement.

— Ken

Simplified hash and extended bloom(ish) filter.
by BrowserUk (Pope) on Aug 05, 2017 at 20:34 UTC

You might also look at this simplified I::C hash. I'm sure it is slower than Perl's hashes, but it is also more compact if space is a problem.

And if bloom filters float your boat, then my 10x32-bit bit-tables from 1x 64-bit hash technique described in this thread might be of interest. You only need read my posts for code and roboticus' for the stats; the rest are noise.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Fastest way to lookup a point in a set
by LanX (Cardinal) on Aug 05, 2017 at 19:37 UTC
> I have a large set of points. ... I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.

How large is "large", how many is "many"?

I have the impression that it's rather a space problem, my approach would be to split a hash into a HoH to allow swapping of the second tier.

This works well if you can organize your look-ups in a way (ordering or caching) that accesses to the second tier of hashes is bundled, such that you have a minimal amount of swapping. see also Re: write hash to disk after memory limit and Re: Small Hash a Gateway to Large Hash?

For instance looking up \$point{\$x}{\$y} would be the fastest if you can you can bundle all look-ups to points (\$x,*) for one fixed \$x.

tl;dr maybe I'm missing the point...(?)

On a side note: are you aware of multi-dim hashes in Perl?

```C:\Windows\system32>perl -e "\$x=1;\$y=2;\$h{\$x,\$y}=4; print \$h{\$x,\$y}"
4

If yes, why do you you need to join the keys by yourself???

> String: \$x . ':' . \$y

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

On a side note: are you aware of multi-dim hashes in Perl?
No, I wasn't aware of them. Curiously, when I added them to my original test, they turned out to be slower than join. That is, Multi:
```exists \$cells->{\$p->[0],\$p->[1]} or die;
was slower than Str:
```exists \$cells->{ join ':', @{\$p} } or die;
Benchmark results:
```Str:   7 wallclock secs ( 6.58 usr + 0.00 sys =.58 CPU) @ 30404.38/s (
+n=200000)
Multi: 8 wallclock secs ( 7.98 usr + 0.00 sys =.98 CPU) @ 25050.10/s (
+n=200000)

Update: though Multi was faster than:

```exists \$cells->{ join ':', \$p->[0], \$p->[1] } or die;

> they turned out to be slower than join.

Interesting, I think that's because you address the points separately in multi (2 lookups) while join can optimize over an array (one list returned)

Maybe try to see how it compares with join over 2 elements.

I wouldn't be surprised if the result is equally fast. Multi dimensional hashes are an old (Perl 4) but somehow obscure feature. They probably never optimized it over the equivalence to join.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Maybe stringification is faster
```use Data::Dump qw/pp dd/;

local \$" = \$; ;

\$p=[1,2];
\$h{"@\$p"}=12;

\$h{3,4}=34;

dd %h;

```("3\x1C4", 34, "1\x1C2", 12)

no time to benchmark, sorry! (no pun intended ;-)

##### update

Probably is fastest to keep points all the time in that representation.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

You did not answer his question though. Can you batch the queries or do you need to run them one by one?

If batching is allowed, then LanX has more or less revealed the solution. The problem is then very similar to sorting, and the fastest way to sort is probably the usual radix sort. (Maybe with AVX512 the balance has tipped in favor of merge sort; but I haven't tested that.)

Once you have binned/sorted the queries, the lookup is essentially a linear pass over the (sorted) dataset. Don't loop over queries, map them.

Re: Fastest way to lookup a point in a set (compares Perl 5.8.9 - 5.26.0)
by marioroy (Parson) on Aug 06, 2017 at 18:01 UTC

It's been a while and thought to compare between Perl 5.8.9 - 5.26.0. Also cperl 5.24.2. First, I modified mat_look to safegaurd from auto-vivification. Thanks, roboticus. Results were obtains from a Mac laptop running at 2.6 GHz (i7 Haswell).

```sub mat_look {
my \$cells = shift;
for my \$p (@points) {
exists \$cells->{\$p->[0]} or die;
exists \$cells->{\$p->[0]}{\$p->[1]} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

Results from Perl 5.8.9 - 5.26.0.

```timethese 200000, {
Big => sub { big_look(\$big_ref) }, # \$cells->{ (\$p->[1] << 32) | (\$
+p->[0] & 0xFFFFFFFF) }
Lan => sub { lan_look(\$lan_ref) }, # \$cells->{ "@\$p" }
Mat => sub { mat_look(\$mat_ref) }, # \$cells->{ \$_->[0] }{ \$_->[1] }
Pak => sub { pak_look(\$pak_ref) }, # \$cells->{ pack "ii", \$p->[0],
+\$p->[1] }
St2 => sub { st2_look(\$st2_ref) }, # \$cells->{ join(':', @\$p) }
St3 => sub { st3_look(\$st3_ref) }, # \$cells->{ \$str } # optimized
Str => sub { str_look(\$str_ref) }, # \$cells->{ \$p->[0] .':'. \$p->[1
+] }
};

Notice that stringification "@\$p" reaches join(':', @\$p) in later releases, 5.22+.

Results from cperl-5.24.2c.

Regards, Mario

Update: Added results from a 4.0 GHz machine running CentOS 7.3.

The script may be useful in the future for comparing with later versions of Perl.

```timethese 200000, {
Big => sub { big_look(\$big_ref) }, # \$cells->{ (\$p->[1] << 32) | (\$
+p->[0] & 0xFFFFFFFF) }
Kgb => sub { kgb_look(\$kgb_ref) }, # \$cells->{ \$str } # optimized
Lan => sub { lan_look(\$lan_ref) }, # \$cells->{ "@\$p" }
Mat => sub { mat_look(\$mat_ref) }, # \$cells->{ \$_->[0] }{ \$_->[1] }
Pak => sub { pak_look(\$pak_ref) }, # \$cells->{ pack "ii", \$p->[0],
+\$p->[1] }
St2 => sub { st2_look(\$st2_ref) }, # \$cells->{ join(':', @\$p) }
St3 => sub { st3_look(\$st3_ref) }, # \$cells->{ \$str } # optimized
Str => sub { str_look(\$str_ref) }, # \$cells->{ \$p->[0] .':'. \$p->[1
+] }
};

So, here is the test.pl script. Basically, the OP script with extra solutions by various monks in this thread.

Perhaps, future processors may reach 5 GHz. The following was captured on a CentOS 7.3 machine running at 4.0 GHz. These days, processors are equipped with Turbo Boost allowing up to 4.0 GHz for single task.

The native Perl on CentOS 7.3 is v5.16.3. I've gone ahead and compiled Perl v5.26.0 for comparison.

```config_args='-Dprefix=/opt/perl-5.26.0 -sder -Dusethreads -Accflags=-m
+sse4.2'

Regards, Mario

Re: Fastest way to lookup a point in a set
by karlgoethebier (Abbot) on Aug 05, 2017 at 19:47 UTC

Yes i know, this is terribly slow and so on. But i had some fun to play with it.

Best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex(\$ENV{KARL});'Help

Update: Reran and posted results from matching Perl versions: perl-5.24.2 and cperl-5.24.2c.

Update: Added results from Perl 5.26.0 for comparison.

With little change, the code can run just as fast.

```use strict;
use warnings;
use Data::Dump qw(pp);
use feature qw(say);

my @points = map { pp(\$_) } (
[ 0,           0 ],
[ -1,          -2 ],
[ 1,           2 ],
[ -1,          2 ],
[ 1,           -2 ],
[ 0,           1 ],
[ 1,           0 ],
[ -1,          0 ],
[ 0,           -1 ],
[ 2147483647,  2147483647 ],
[ 2147483647,  -2147483647 ],
[ -2147483647, 2147483647 ],
[ -2147483647, -2147483647 ],
);

kgb_str_look( kgb_str_hash() );

sub kgb_str_hash {
my %cells = map { \$_ => undef } @points;
\%cells;
}

sub kgb_str_look {
my \$cells = shift;
for (@points) {
( exists \$cells->{\$_} ) ? say \$_ : die;
}
}

In the spirit of fun, added kgb to the test script. The mat_hash was updated to safeguard from auto-vivification. See this post.

```use Data::Dump qw(pp);

sub kgb_hash {
my %cells = map { pp(\$_) => undef } @points;
\%cells;
}

{
my @points_str;

sub kgb_look {
my \$cells = shift;
unless (@points_str) {
for my \$p (@points) {
push @points_str, pp(\$p);
}
}
for my \$p (@points_str) {
exists \$cells->{\$p} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}
}

my \$kgb_ref = kgb_hash();

...

timethese 200000, {
Big => sub { big_look(\$big_ref) }, # \$cells->{ (\$p->[1] << 32) | (\$
+p->[0] & 0xFFFFFFFF) }
Kgb => sub { kgb_look(\$kgb_ref) }, # \$cells->{ \$str } # optimized
Lan => sub { lan_look(\$lan_ref) }, # \$cells->{ "@\$p" }
Mat => sub { mat_look(\$mat_ref) }, # \$cells->{ \$_->[0] }{ \$_->[1] }
Pak => sub { pak_look(\$pak_ref) }, # \$cells->{ pack "ii", \$p->[0],
+\$p->[1] }
St2 => sub { st2_look(\$st2_ref) }, # \$cells->{ join(':', @\$p) }
St3 => sub { st3_look(\$st3_ref) }, # \$cells->{ \$str } # optimized
Str => sub { str_look(\$str_ref) }, # \$cells->{ \$p->[0] .':'. \$p->[1
+] }
};

Results obtained from a Mac laptop running at 2.6 GHz, comparing perl-5.24.2 against cperl-5.24.2c.

Results from Perl 5.26.0 for comparison.

For some reason, the state feature doesn't support initialization in list context.

```sub kgb_look {
my \$cells = shift;
state @points_str = map { pp(\$_) } @points;
...
}

Initialization of state variables in list context currently forbidden
+at test.pl line 341, near "@points;"
Execution of test.pl aborted due to compilation errors.

Alrighty, state initialization works fine for an anonymous array.

```use feature qw(state);

sub kgb_look {
my \$cells = shift;
state \$points_str = [ map { pp(\$_) } @points ];
for my \$p (@{ \$points_str }) {
exists \$cells->{\$p} or die;
}
exists \$cells->{'notfound'} and die;
exists \$cells->{'notfound2'} and die;
exists \$cells->{'notfound3'} and die;
}

Regards, Mario

"...little change..."

Ouch! Yes, sure - just calling it once. My fault. Very nice. See also. Thank you very much Mario and best regards, Karl

«The Crux of the Biscuit is the Apostrophe»

This signature has been bowdlerized because of a special request.

Re: Fastest way to lookup a point in a set
by talexb (Canon) on Aug 08, 2017 at 16:05 UTC

I'm late to this party, but did you try a database solution? I would give SQLite a shot.

Alex / talexb / Toronto

Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Hello talexb,

The following is a demonstration comparing a plain hash against CDB_File, DB_File. and SQLite_File.

```use strict;
use warnings;
use feature 'state';

use CDB_File;
use DB_File;
use SQLite_File;
use Time::HiRes 'time';

my @points = (
[  0,  0 ],
[ -1, -2 ],
[  1,  2 ],
[ -1,  2 ],
...
);

sub plain_hash {
my %hash;
\$hash{ join(':',@{\$_}) } = 1 for @points;
\%hash;
}

sub cdb_hash {
my %hash;

# create CDB file
my \$cdb = CDB_File->new("t.cdb", "t.cdb.\$\$")
or die "\$!\n";
\$cdb->insert( join(':',@{\$_}), 1 ) for @points;
\$cdb->finish;

# use CDB file
tie %hash, 'CDB_File', "t.cdb" or die "\$!\n";
\%hash;
}

sub db_hash {
my %hash;

# create DB file
tie %hash, 'DB_File', "t.db", O_CREAT|O_RDWR, \$DB_BTREE;
\$hash{ join(':',@{\$_}) } = 1 for @points;
untie %hash;

# use DB file
tie %hash, 'DB_File', "t.db", O_RDWR, \$DB_BTREE;
\%hash;
}

sub sql_hash {
tie my %hash, 'SQLite_File', "sql.db";
\$hash{ join(':',@{\$_}) } = 1 for @points;
\%hash;
}

sub look {
my \$cells = shift;
state \$points_str = [ map { join(':',@{\$_}) } @points ];
for my \$p (@{ \$points_str }) {
exists \$cells->{\$p} or die;
}
}

sub bench {
my ( \$desc, \$func   ) = @_;
my ( \$cells, \$iters ) = ( \$func->(), 50000 );

my \$start = time;
look(\$cells) for 1..\$iters;
my \$elapse = time - \$start;

printf "%s duration : %0.03f secs\n", \$desc, \$elapse;
printf "%s lookups  : %d / sec\n",
\$desc, @points * \$iters / \$elapse;
}

bench( "plain", \&plain_hash );
bench( "  cdb", \&cdb_hash   );
bench( "   db", \&db_hash    );
bench( "  sql", \&sql_hash   );

Results: The native Perl on Mac OS X 10.11.6 is v5.18.2. The hardware is a Haswell i7 chip at 2.6 GHz.

Regarding CDB_File and DB_File performance, these run better with Perl 5.20 and later releases.

```\$ perl test.pl
plain duration : 0.639 secs
plain lookups  : 10243321 / sec
cdb duration : 5.692 secs
cdb lookups  : 1150711 / sec
db duration : 9.746 secs
db lookups  : 672046 / sec

\$ /opt/perl-5.26.0/bin/perl test.pl
plain duration : 0.517 secs
plain lookups  : 12677776 / sec
cdb duration : 3.881 secs
cdb lookups  : 1687545 / sec
db duration : 6.134 secs
db lookups  : 1067780 / sec
sql duration : 184.338 secs
sql lookups  : 35532 / sec

Regards, Mario

Thanks for that .. I wonder what happens when the number of data points increases by a few orders of magnitude? That's where I would expect a database solution to start to overtake the hash-based solution.

Alex / talexb / Toronto

Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

did you try a database solution?

I did. It was so spectactularly much slower that I didn't bother posting it (and I used postgres).

Create A New User
Node Status?
node history
Node Type: perlquestion [id://1196786]
Approved by Athanasius
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2020-11-25 19:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?