Benchmarking "Are all these characters in this sentence?"

(Edit: Updated based on comments below, and cleaned up. I've broken out the tests from the benchmark, done the tests using Test::More, and factored out the list of test cases.

I've also added a few more tests from tassilo, although some of them failed the tests and are commendted out for now. )

Another Edit: repellent wins!! unpack is the way to go...

_Another_ Another Edit: tye wins! Not building a list is even better... See tye's post. I have updated this post w/ the tye results.

_Another_^3 Edit: BrowserUK wins! Well, after the fix so it isn't so destructive of the testArray.

A variant of BrowserUK's approach that uses substr comes in second, but a long ways behind chop.

_Another_^4 Edit: BrowserUK wins again!

buk2 is by far the fastest, but fails a "0", "0" test, so buk2_len which is a bit slower is the real winner, still 10 times faster than the competition for some test sets.

For REALLY fast routines being benchmarked, copying args is clearly a very bad idea. However, other routines like tye0 to tye2 which were the fastest for a while don't benefit as much from removing the argument copies (see the _opt variants in the results).

_Another_^5 Edit: BrowserUK wins again (again)!

buk3 is by far the fastest and passes all the tests

In check if all certain chars are found in a sentence, tallulah asked for improvements on a routine trying to check whether a given sentence contains all of a set of specified characters.

Many monks suggested solutions, with way more than one way to do it.

Here are the results of some benchmarks on all of the suggested approaches on my Macbook Pro, using the provided 5.8.8 build.

I was quite surprised by the results, actually... I'd have bet good money that the slice approach was going to beat anything else, unless Tanktalus's List::MoreUtils approaches won. I promise, I was going to post this no wonder which implementation won :)

Where a solution was regex based, I also added a variant using "study", to see how that impacted performance.

If anyone wants to add a few test cases then run the tests, invoke the script with a -t argument.

Detailed results behind readmore, to fix Meditations page formatting:

Short                                                                   
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                              105/s              --
Tanktalus_AllRegex_Study                           127/s             22%
Tanktalus_AllRegex                                 130/s             24%
varian_hash                                        146/s             40%
RMGir_slice                                        207/s             98%
moritz_BuildRegex_WithStudy                        243/s            133%
moritz_BuildRegex                                  261/s            149%
ysth_loookahead                                    350/s            234%
Tanktalus_AllIndex                                 411/s            293%
unpack_allindex                                    444/s            324%
unpack_allrindex                                   448/s            328%
tassilo_listutils_r                                581/s            455%
tye2                                               655/s            525%
tye1                                               661/s            531%
RMGir_index                                        667/s            537%
repellent_unpack                                   770/s            635%
repellent_unpack_opt                               792/s            657%
tye2_opt                                           807/s            671%
tye1_opt                                           815/s            678%
tye0                                               829/s            692%
tye0_opt                                           830/s            693%
buk_substr                                        1152/s           1000%
buk                                               1829/s           1648%
buk2_len                                          4443/s           4144%
buk2                                              4610/s           4305%
buk3                                              5144/s           4814%
                                                                        
                                                                        
LongShort                                                               
                                                    Rate     varian_hash
varian_hash                                       4.90/s              --
RMGir_slice                                       5.88/s             20%
tallulah_OriginalPost                              168/s           3319%
Tanktalus_AllRegex_Study                           173/s           3422%
Tanktalus_AllRegex                                 209/s           4171%
moritz_BuildRegex_WithStudy                        243/s           4867%
moritz_BuildRegex                                  396/s           7983%
ysth_loookahead                                    575/s          11640%
Tanktalus_AllIndex                                 731/s          14822%
unpack_allrindex                                   799/s          16201%
unpack_allindex                                    821/s          16650%
tassilo_listutils_r                                914/s          18549%
RMGir_index                                       1017/s          20652%
tye2                                              1018/s          20671%
tye1                                              1046/s          21234%
repellent_unpack                                  1158/s          23518%
repellent_unpack_opt                              1207/s          24515%
tye2_opt                                          1386/s          28180%
tye0_opt                                          1399/s          28442%
tye0                                              1414/s          28746%
tye1_opt                                          1440/s          29269%
buk_substr                                        1761/s          35815%
buk                                               2849/s          58012%
buk2_len                                          9489/s         193482%
buk2                                              9686/s         197486%
buk3                                             11005/s         224395%
                                                                        
                                                                        
ShortLong                                                               
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                             3.20/s              --
Tanktalus_AllRegex                                4.42/s             38%
Tanktalus_AllRegex_Study                          4.42/s             38%
moritz_BuildRegex_WithStudy                       5.36/s             67%
moritz_BuildRegex                                 5.41/s             69%
varian_hash                                       8.18/s            156%
ysth_loookahead                                   9.26/s            189%
RMGir_index                                       14.4/s            351%
tassilo_listutils_r                               15.9/s            396%
RMGir_slice                                       15.9/s            396%
Tanktalus_AllIndex                                16.0/s            401%
repellent_unpack_opt                              16.8/s            425%
repellent_unpack                                  17.1/s            435%
unpack_allrindex                                  20.0/s            525%
unpack_allindex                                   20.2/s            531%
tye2_opt                                          25.5/s            696%
tye0_opt                                          25.7/s            703%
tye0                                              25.7/s            704%
tye1_opt                                          26.2/s            719%
tye2                                              26.9/s            739%
tye1                                              27.4/s            755%
buk_substr                                        64.0/s           1900%
buk                                               67.9/s           2021%
buk2_len                                           888/s          27649%
buk2                                               939/s          29247%
buk3                                              1109/s          34553%
                                                                        
                                                                        
LongLong                                                                
                                                    Rate tallulah_OriginalPost
tallulah_OriginalPost                             3.20/s              --
Tanktalus_AllRegex_Study                          4.46/s             40%
Tanktalus_AllRegex                                4.50/s             41%
varian_hash                                       4.59/s             43%
moritz_BuildRegex                                 5.41/s             69%
moritz_BuildRegex_WithStudy                       5.41/s             69%
RMGir_slice                                       6.80/s            112%
ysth_loookahead                                   9.40/s            194%
RMGir_index                                       12.9/s            302%
repellent_unpack_opt                              15.1/s            372%
repellent_unpack                                  15.6/s            387%
tassilo_listutils_r                               16.0/s            399%
Tanktalus_AllIndex                                16.0/s            401%
unpack_allindex                                   20.4/s            537%
unpack_allrindex                                  20.6/s            543%
tye1_opt                                          25.9/s            708%
tye0                                              26.0/s            711%
tye2_opt                                          26.0/s            711%
tye0_opt                                          26.2/s            719%
tye2                                              27.3/s            752%
tye1                                              27.3/s            752%
buk_substr                                        63.6/s           1889%
buk                                               67.3/s           2002%
buk2_len                                           852/s          26537%
buk2                                               896/s          27908%
buk3                                              1017/s          31690%
                                                                        
                                                                        
VeryLong                                                                
                                                    Rate     varian_hash
varian_hash                                       8.41/s              --
RMGir_slice                                       10.2/s             21%
tallulah_OriginalPost                              189/s           2152%
Tanktalus_AllRegex_Study                           205/s           2332%
Tanktalus_AllRegex                                 245/s           2810%
moritz_BuildRegex_WithStudy                        346/s           4019%
moritz_BuildRegex                                  508/s           5942%
ysth_loookahead                                    800/s           9411%
Tanktalus_AllIndex                                1090/s          12861%
unpack_allrindex                                  1184/s          13980%
tassilo_listutils_r                               1207/s          14245%
unpack_allindex                                   1230/s          14521%
tye2_opt                                          1267/s          14963%
tye1_opt                                          1268/s          14974%
tye0                                              1279/s          15106%
tye2                                              1280/s          15118%
RMGir_index                                       1280/s          15118%
tye0_opt                                          1292/s          15264%
tye1                                              1297/s          15323%
repellent_unpack                                  1506/s          17801%
repellent_unpack_opt                              1527/s          18058%
buk_substr                                        2634/s          31219%
buk                                               2830/s          33543%
buk2_len                                         24093/s         286343%
buk2                                             25048/s         297687%
buk3                                             28768/s         341917%

I'd be interested in seeing how these numbers change for 5.10, if anyone has an installation handy. The benchmark code is behind the second readmore tag.

#!/usr/bin/perl -w

use strict;
use List::MoreUtils qw(all);
use Benchmark qw(cmpthese);

#use Test::More qw(no_plan);
require Test::More;

my @shortTestCases=(
    # sentence                   wantedChars  result
    [ "abxcd zwe rrv",           "0",         0 ],
    [ "0",                       "0",         1 ],
    [ "abxcd zwe rrv",           "xxv",       1 ],
    [ "abxcd zwe rrv",           "xxvq",      0 ],
    [ "abxcd zwe rrv",           "",          1 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "abcdefghijklmnopqrstuvwxyz",
                                              1 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "abcdefghijklmnopqrstuvwxyzT",
                                              1 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "abcdefghijklmnopqrstuvwxyzTU",
                                              0 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "a",         1 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "",          1 ],
);

# Long sentence, short wantedChars
my @longShortTestCases = (
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "",          1 ],
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "a",          1 ],
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "abcdefghijklmnopqrstuvwxyzT",
                                              1 ],
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "abcdefghijklmnopqrstuvwxyzTU",
                                              0 ],
);

# Short sentence, long wantedChars
my @shortLongTestCases = (
    [ "The quick brown fox jumps over the lazy dog",
                                 "abcdefghijklmnopqrstuvwxyzT"x100,
                                              1 ],
    [ "The quick brown fox jumps over the lazy dog",
                                 "abcdefghijklmnopqrstuvwxyzTU"x100,
                                              0 ],
);

# Long sentence, long wantedChars
my @longLongTestCases = (
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "abcdefghijklmnopqrstuvwxyzT"x100,
                                              1 ],
    [ "The quick brown fox jumps over the lazy dog" x 100,
                                 "abcdefghijklmnopqrstuvwxyzTU"x100,
                                              0 ],
);

# VERY long test case
my $alphabet=(join '' => 'a' .. 'z', 'A' .. 'Z');
my @veryLongTestCase = (
        [
            random(10000, $alphabet),
                                  $alphabet, 1 ],
);

my @testCases = (
    @shortTestCases,
    @shortLongTestCases,
    @longShortTestCases,
    @longLongTestCases,
    @veryLongTestCase
);

sub random {
   my $num = shift;
   my $wantedChars = shift;
   my @chars = ('a' .. 'z', 'A' .. 'Z');
   my $result;
   # make sure result will work
   do  {
       $result = join '' => map $chars[rand @chars], 1 .. $num;
   } until Tanktalus_AllIndex($result, $wantedChars);

   return $result;
}



sub test_routine
{
    my ($testFn, $testName)=@_;

    foreach(@testCases) {
       my ($sentence, $wantedLetters, $expectedResult) = @$_;
       ok (!!($testFn->($sentence, $wantedLetters)) == !!$expectedResu
+lt);
    }
}

sub benchmark_routine
{
    my ($testFn, $testName, $testCases)=@_;

    foreach(@$testCases) {
        my ($sentence, $wantedLetters, $expectedResult) = @$_;
        $testFn->($sentence, $wantedLetters) for 1..20;
    }
}


# [id://707122]
sub tallulah_OriginalPost
{
    my ($sentence, $wantedLetters)=@_;
    my $flag=0;
    my @a = split '',$wantedLetters;
    for( my $i=0; $i<$#a+1; $i++ ) {
        if($sentence !~ /$a[$i]/) {
            $flag=1;last;
        }
    }

    return !$flag;
}

# [id://707123]
sub moritz_BuildRegex
{
    my ($sentence, $wantedLetters)=@_;

    my $re = '^' . join '', map "(?=.*?$_)", map quotemeta, split m//,
             $wantedLetters;

    if ($sentence =~ m/$re/) {
        return 1;
    }

    return 0;
}

# [id://707123]
sub moritz_BuildRegex_WithStudy
{
    my ($sentence, $wantedLetters)=@_;

    my $re = '^' . join '', map "(?=.*?$_)", map quotemeta, split m//,
             $wantedLetters;

    study $sentence;
    if ($sentence =~ m/$re/) {
        return 1;
    }

    return 0;
}

# [id://707124]
sub RMGir_index {
    my ($sentence, $wantedLetters)=@_;

    # don't need this variable (or any of them, in
    # fact -- they're just here for clarity.
    # we could work straight out of @_ if we wanted
    # this terser
    # Also, the $[ check is just pedantic - if someone
    # changes $[, shoot them.
    my $foundLetters=scalar (grep index($sentence,$_)>=$[,
                                  split //,$wantedLetters);

    return length($wantedLetters)==$foundLetters;
}

# [id://707222]
sub Tanktalus_AllRegex {
    my ($sentence, $letters) = @_;

    return 1 unless length($letters);

    # all we're doing is checking for each letter.
    all { $sentence =~ $_ } split //, $letters;
}

# [id://707222]
sub Tanktalus_AllRegex_Study {
    my ($sentence, $letters) = @_;

    return 1 unless length($letters);

    study $sentence;

    # all we're doing is checking for each letter.
    all { $sentence =~ $_ } split //, $letters;

    # same as above, but with index which I think is less readable.
    #all { index($sentence, $_) >= $[ } split //, $letters;
}

# [id://707222]
sub Tanktalus_AllIndex {
    my ($sentence, $letters) = @_;

    return 1 unless length($letters);

    # same as above, but with index which I think is less readable.
    all { index($sentence, $_) >= $[ } split //, $letters;
}

# JavaFan's looks about equivalent to OP approach

# Doesn't have same repeated letter semantics specified in
# OP post.
# [id://707176]
sub oshalla_scan {
    my ($sentence, $wanted) = @_ ;

    while (length($wanted)) {
        return 0 if ($sentence !~ m/([$wanted])/g) ;
        $wanted =~ s/$1// ;
    } ;

    return 1;
}

# [id://707231]
sub varian_hash {
    my ($sentence, $wantedLetters)=@_;

    my %required = map {$_ => 1} split //,$wantedLetters;
    map delete $required{$_}, split //, $sentence;

    if (keys %required) {
        return 0;
    }
    else {
        return 1;
    }
}

# [id://707314]
sub RMGir_slice {
    my ($sentence, $wantedLetters)=@_;

    my %required;
    @required{split //,$wantedLetters}=();
    delete @required{split //, $sentence};

    if (keys %required) {
        return 0;
    }
    else {
        return 1;
    }
}

# [tassilo]'s test cases


sub makeSubRef {
    return @_[1,0];
}

my %testSubroutines=(
    makeSubRef(\&tallulah_OriginalPost,
        "tallulah_OriginalPost"),
    makeSubRef(\&moritz_BuildRegex,
        "moritz_BuildRegex"),
    makeSubRef(\&moritz_BuildRegex_WithStudy,
        "moritz_BuildRegex_WithStudy"),
    makeSubRef(\&RMGir_index,
        "RMGir_index"),
    makeSubRef(\&Tanktalus_AllRegex,
        "Tanktalus_AllRegex"),
    makeSubRef(\&Tanktalus_AllRegex_Study,
        "Tanktalus_AllRegex_Study"),
    makeSubRef(\&Tanktalus_AllIndex,
        "Tanktalus_AllIndex"),
    makeSubRef(\&varian_hash,
        "varian_hash"),
    makeSubRef(\&RMGir_slice,
        "RMGir_slice"),
    tassilo_listutils_r => sub {
        return 1 unless length($_[1]);

        all { rindex($_[0], $_) >= 0 } split //, $_[1];
    },

    repellent_unpack => sub {
        my ($sentence, $wantedLetters)=@_;

        my $foundLetters=scalar (grep index($sentence,$_)>=$[,
            unpack "(a)*", $wantedLetters);

        return length($wantedLetters)==$foundLetters;
    },

    repellent_unpack_opt => sub {
        length($_[1])==scalar (grep index($_[0],$_)>=$[,
            unpack "(a)*", $_[1]);
    },

    unpack_allindex => sub {
        return 1 unless length($_[1]);

        my ($sentence, $wantedLetters)=@_;

        all { index($sentence, $_) >= $[ }
                                    unpack "(a)*", $wantedLetters;
    },

    unpack_allrindex => sub {
        return 1 unless length($_[1]);

        my ($sentence, $wantedLetters)=@_;

        all { rindex($sentence, $_) >= $[ }
            unpack "(a)*", $wantedLetters;
    },


    tye2_opt => sub {
        while( $_[1] =~ /(.)/gs ) {
            return 0 if -1 == index($_[0],$1);
        }
        return 1;
    },

    tye1_opt => sub {
        -1 == index($_[0],$1) && return 0
        while( $_[1] =~ /(.)/gs );
        return 1;
    },

    tye0_opt => sub {
        -1 == index( $_[0], $1 )  &&  return 0
        while( $_[1] =~ /(.)/gs );
        return 1;
    },


    tye2 => sub {
        my( $sentence, $wantedLetters )= @_;
        while( $wantedLetters =~ /(.)/gs ) {
            return 0 if -1 == index($sentence,$1);
        }
        return 1;
    },

    tye1 => sub {
        my( $sentence, $wantedLetters )= @_;
        -1 == index($sentence,$1) && return 0
        while( $wantedLetters =~ /(.)/gs );
        return 1;
    },

    tye0 => sub {
        -1 == index( $_[0], $1 )  &&  return 0
        while( $_[1] =~ /(.)/gs );
        return 1;
    },

    # FAILS the "0" "0" test....
    buk => sub {
        my( $s, $w ) = @_;
        my $c;
        1+index $s, $c or return 0 while $c = chop $w;
        1;
    },

    buk_substr => sub {
        my( $s, $w ) = @_;
        1+index $s, substr($w,$_,1) or return 0 foreach 0..length($w);
        1;
    },

    # FAILS the "0" "0" test....
    buk2 => sub {
        local $_;
        1+index $_[0], $_ or return while $_ = chop $_[1];
        1;
    },

    buk2_len => sub {
        local $_;
        1+index $_[0], $_ or return while length($_ = chop $_[1]);
        1;
    },

    buk3 => sub {
        1+index $_[0], chop $_[ 1 ] or return for 1 .. length $_[ 1 ];
        1;
    },

    buk4 => sub {
        ( -1 != index $_[0], chop $_[ 1 ] ) || return for 1 .. length 
+$_[ 1 ];
        1;
    },

    ysth_loookahead => sub {
        my ($sentence, $wantedLetters) = @_;
        $wantedLetters =~ s/(.)/(?=.*?\Q$1\E)/sg;
        $sentence =~ /^$wantedLetters/s;
    }

#   # These 2 routines need more debugging - they fail the tests
#   tassilo_xor => sub {
#       return 1 unless length($_[1]);
#       my $copy = $_[0];
#       for (split //, $_[1]) {
#           my $mask = ($_) x length $copy;
#           $copy ^= $mask;
#           $copy =~ tr/\000//d;
#           $copy ^= ( ($_) x length $copy );
#       };
#       length($copy) == 0;
#   },
#   tassilo_tr => sub {
#       return 1 unless length($_[1]);
#       my $copy = $_[0];
#       eval "\$copy =~ tr/$_[1]//d";
#       length($copy) == 0;
#   }

);


if(@ARGV && $ARGV[0] eq "-t") {
    use Test::More;

    plan tests => ((scalar keys %testSubroutines) * scalar @testCases)
+;

    print "Testing routines...\n";
    foreach my $name(sort keys %testSubroutines) {
        print "Testing $name\n";
        test_routine($testSubroutines{$name}, $name);
    }

    exit(0);
}


print "Running benchmarks...\n";

my $testsRef = \@shortTestCases;

my $benchmark_routines={
       map { ($_,
              eval qq[sub { benchmark_routine(\$testSubroutines{"$_"},
                                              "$_", \$testsRef); }
              ])
           } keys %testSubroutines
    };

print "Short      \n";
cmpthese(-1, $benchmark_routines);


$testsRef = \@longShortTestCases;
print "\n\n";
print "LongShort   \n";
cmpthese(-1, $benchmark_routines);

$testsRef = \@shortLongTestCases;
print "\n\n";
print "ShortLong   \n";
cmpthese(-1, $benchmark_routines);

$testsRef = \@longLongTestCases;
print "\n\n";
print "LongLong    \n";
cmpthese(-1, $benchmark_routines);

$testsRef = \@veryLongTestCase;
print "\n\n";
print "VeryLong    \n"; 
cmpthese(-1, $benchmark_routines);
[download]

Mike

Comment on Benchmarking "Are all these characters in this sentence?" Download Code

Replies are listed 'Best First'.

Re: Benchmarking "Are all these characters in this sentence?"
by moritz (Cardinal) on Aug 28, 2008 at 23:05 UTC

Currently the comparison is just a random point of data in the big, two-dimensional benchmark space ;-)

[reply]

Re^2: Benchmarking "Are all these characters in this sentence?"

by RMGir (Prior) on Aug 29, 2008 at 00:22 UTC

I've added a few 3k-ish character search strings, and a few 3k-ish character character sets... Doing the long cases moves the results more in favour of Tanktalus_AllIndex:

Short sentence and search set cases

tallulah_OriginalPost            2054/s
Tanktalus_AllRegex               2511/s
Tanktalus_AllRegex_Study         2522/s
moritz_BuildRegex_WithStudy      2595/s
moritz_BuildRegex                2715/s
varian_hash                      2983/s
RMGir_slice                      4035/s
Tanktalus_AllIndex               8219/s
RMGir_index                     12107/s


Long sentence and Short search set cases

varian_hash                      97.2/s
RMGir_slice                       115/s
moritz_BuildRegex_WithStudy      3172/s
tallulah_OriginalPost            3230/s
Tanktalus_AllRegex_Study         3319/s
Tanktalus_AllRegex               4054/s
moritz_BuildRegex                4250/s
Tanktalus_AllIndex              13032/s
RMGir_index                     17612/s

Short sentence and Long search set cases
moritz_BuildRegex_WithStudy      54.1/s
moritz_BuildRegex                54.6/s
tallulah_OriginalPost            63.6/s
Tanktalus_AllRegex               86.9/s
Tanktalus_AllRegex_Study         87.1/s
varian_hash                       161/s
RMGir_index                       285/s
RMGir_slice                       319/s
Tanktalus_AllIndex                320/s

Long sentence and Long search set cases
moritz_BuildRegex_WithStudy      54.1/s
moritz_BuildRegex                54.6/s
tallulah_OriginalPost            63.6/s
varian_hash                      86.9/s
Tanktalus_AllRegex_Study         87.7/s
Tanktalus_AllRegex               87.7/s
RMGir_slice                       135/s
RMGir_index                       250/s
Tanktalus_AllIndex                319/s


Problems? Is your data what you think it is?
	PerlMonks