perlquestion
Limbic~Region
All,
<br />
Tonight on #perl in freenode, a user asked "<i>What is the fastest way to find the number of substrings two strings have in common?</i>"
<p>
The user had working code but wanted to know if it could be made faster without relying on [cpan://Inline::C]. We asked a number of clarifying questions to flesh out the requirements:
</p>
<ul>
<li>Only the count is required, not the substrings</li>
<li>A substring that matches in multiple places counts only once</li>
<li>The substring length is user defined</li>
<li>Substrings, as defined by the requester, are contiguous. All substrings of 'ABC' are A, B, C, AB, BC, ABC</li>
</ul>
My idea was to generate an [doc://unpack] template programmatically that would return all the substrings at once. It would then be a simple matter of returning the count of intersecting substrings. It seemed like this could be done without any explicit loops. My solution follows:
<p>
<spoiler>
<CODE>
#!/usr/bin/perl
use strict;
use warnings;
print common_substr('ABCD', 4, 'BCD', 3, 2);
{
my %seen;
sub common_substr {
my ($str1, $len_str1, $str2, $len_str2, $len_subs) = @_;
my $temp1 = exists $seen{$len_str1}{$len_subs}
? $seen{$len_str1}{$len_subs}
: ($seen{$len_str1}{$len_subs} = ("a${len_subs}X" . ($len_subs - 1)) x ($len_str1 - $len_subs + 1));
my %substr;
@substr{ unpack($temp1, $str1) } = ();
my $temp2 = exists $seen{$len_str2}{$len_subs}
? $seen{$len_str2}{$len_subs}
: ($seen{$len_str2}{$len_subs} = ("a${len_subs}X" . ($len_subs - 1)) x ($len_str2 - $len_subs + 1));
my $count = keys %substr;
delete @substr{ unpack($temp1, $str2) };
return $count - keys %substr;
}
}
</CODE>
</spoiler>
</p>
<p>
It seemed to me that the challenge would be much more interesting if the substring length were allowed to be a range so that's worth bonus points. What is the fastest solution you can come up with in pure perl?
</p>
<p>
<b>Update:</b> Added definition of <i>substring</i> as well as minor code change assigning hash slice to empty list ([f00li5h]++).
</p>
<p>
<b>Update: </b>2007-04-04 08:25 EST - Thanks to everyone who has contributed so far. I will be posting a [cpan://Benchmark] after work so that folks who want to optimize for a given data set may.
</p>
<p>
<b>Update: </b>2007-04-05 08:14 EST - Two days after scheduling to transfer my service to another provider, my ISP mysteriously starts having problems with my account - coincidence? I would appreciate it if someone could post a [cpan://Benchmark] assuming 5,000 pairs of strings ranging in length from 30 to 50 lowercase letters with a desired substring length ranging from 3 to 7.
</p>
<div class="pmsig"><div class="pmsig-180961">
<p>
Cheers - [Limbic~Region|L~R]
</p>
</div></div>