Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I was recently reading an article in DDJ(April 2001) about suffix arrays. As described in the article, the procedure in C is to 1) fill an array with pointers to every position in a string, 2) sort the array, and 3) search the array for long phrases. This is an extremely fast way to find the longest duplicated substrings. The crux of the technique is to create pointers into a string, which in C is really an array, and avoids the problem of copying or moving lots of substrings in memory.
My question is, how could this be duplicated in Perl? Is there a technique for creating references to parts inside a scalar? I know there is a module for this, but it is simply a frontend to a c program. Can this be done strictly in Perl without using a pile of memory?
I took a shot at doing this, based on the DDJ code, but it is slow as bejeezus. Please Perl Monks, grant me your wisdom!!
#!/usr/bin/perl -w
use strict;
my $input;
{ local $/=undef;
$input=<>;
}
my @arr=(0 .. (length($input)-2));
my @arr2=sort {
makestr($a)
cmp
makestr($b)
} @arr;
my $maxlen=0;
my $maxindex=0;
for(my $i=0; $i<(length($input)-2); $i++) {
my $temp=checklen($arr2[$i], $arr2[$i+1], \$input);
if ($temp > $maxlen) {
$maxlen=$temp;
$maxindex=$i;
}
}
print "maxlen:$maxlen index:$maxindex string:",substr($input,0,$maxlen
+),"\n";
sub makestr{
return substr($input, $_[0], (length($input)-$_[0]));
}
sub checklen{
my $aa= $_[0];
my $bb=$_[1];
my $input=$_[2];
my $counter=0;
my $len=length($$input)-1;
while(($aa < $len)&& ($bb < $len)) {
last unless (substr($$input, $aa++,1) eq substr($$input, $bb++,1))
+;
$counter++;
}
return $counter;
}
Re: suffix arrays
by Zaxo (Archbishop) on Jul 18, 2003 at 20:59 UTC
|
A reference to substr's return value has the properties you want,
my $foo = "A string to work on.";
my @substrs = map {\substr $foo, $_} 0 .. length($foo) - 1;
my @sorted = sort { $$a cmp $$b } @substrs;
That the reference is to the original string is shown by:
$ perl -e '$foo = "A short string\n"; $bar = \substr $foo, 0, 1; $$bar
+ = "The"; print $foo'
The short string
$
After Compline, Zaxo | [reply] [d/l] [select] |
|
perl> print join "\n", map{ \substr 'The quick brown fox', $_ } 0 ..
+19
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
LVALUE(0x15d7cf4)
As you can see, each time you take an Lvalue from substr, it re-uses the same address. So you end up with and array of pointers to the last char in the string.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] [d/l] |
|
perl -e'my $foo = "A string"; my @foo = map{\substr $foo, $_} 0..7; pr
+int map {$$_,$/} @foo;'
A string
string
string
tring
ring
ing
ng
g
$
5.8.1 does more of what I mean!
After Compline, Zaxo | [reply] [d/l] |
|
Re: suffix arrays
by BrowserUk (Patriarch) on Jul 18, 2003 at 21:27 UTC
|
You can get away with using the Lvalues from substr in the array as suggested by Zaxo above, but you have to jump through a few hoops to prevent the problem I alluded to.
Basically, there seems to be a single slot allocated for all the Lvalues generated by any given statement that includes a substr. That's a clumsy way to put it, but I can't think of a better way to say it.
To work around the limitation, it is necessary to use a 'different' substr each time, hence the need for the eval in this code. The downside is that the eval markly reduces the efficiency. The algorithm works fine though and this implementation shoudl be a bit quicker than yours.
#! perl -slw
use strict;
sub ATOa{ map{ eval '\substr( $_[0], $_ )' } 0 .. length( $_[0] ); }
my @Aoa = sort{ $$a cmp $$b } ATOa $ARGV[0];
#print $$_ for @Aoa;
my $max = '';
for my $i ( 0 .. ($#Aoa-1) ) {
my $p = 0;
++$p while substr( ${$Aoa[ $i ]}, 0, $p )
eq substr( ${$Aoa[ $i+1 ]}, 0, $p );
$max = substr( ${$Aoa[ $i ]}, 0, $p-1 )
if length $max < $p;
}
print "The longest common substring in\n'$ARGV[0]'\n is '$max'";
__DATA__
P:\test>test "the infernal invention was the source or eternal facinat
+ion for the qualified professional and enthusiastic amateur alike"
The longest common substring in
'the infernal invention was the source of eternal facination for the q
+ualified professional and enthusiastic amateur alike'
is 'ernal '
P:\test>test "Four and twenty virgins came down from Inverness. And wh
+en they went back again there were four and twenty less"
The longest common substring in
'Four and twenty virgins came down from Inverness. And when they went
+back again there were four and twenty less'
is 'our and twenty '
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
| [reply] [d/l] |
|
sub ATOa {
my $code = '('
. join( ',', map { "\\substr(\$_[0], $_)" } 0 .. length(
+$_[0] ) )
. ')';
return eval $code;
}
:-)
-sauoq
"My two cents aren't worth a dime.";
| [reply] [d/l] |
|
my $p = length($max);
- Cees | [reply] [d/l] |
Re: suffix arrays
by sauoq (Abbot) on Jul 18, 2003 at 20:54 UTC
|
I'm a little confused about what you are asking for. You want to search a set of strings and try to find common suffixes? Your implementation doesn't seem to do that.
Your implementation also isn't very efficiently written. Your sort of @arr, for instance, could benefit from a Schwartzian Transform.
Looking into my crystal ball... I have a feeling hashes will play a part in your answer.
Update: (After some research at ddj.com...) Oh, I see. You are talking about doing substring searches using suffix trees... Frankly, I'm not sure why you would bother to do that in Perl where index() and rindex() essentially do it for you. :-)
Update: To be fair, there are some nifty things you can do with suffix trees that aren't as easy as using index(). That doesn't necessarily mean that using suffix trees is the best way to do them in Perl though. Do you have a specific application in mind or were you just hoping to play with suffix trees in Perl?
-sauoq
"My two cents aren't worth a dime.";
| [reply] |
Re: suffix arrays
by jonadab (Parson) on Jul 19, 2003 at 14:45 UTC
|
I took a shot at doing this, based on the DDJ code
This will generally not work very well. You asked for
wisdom, so here is some: Perl is not C. What is fast
in C may be slow in Perl, and vice versa, but even more
importantly, many "algorithms" designed for C are
designed to take detailed control over minutia. In
many cases they are not so much algorithms as
implementations, or perhaps (as in this case) something
of each.
I generally have misgivings when I see people trying to
do line-for-line, function-for-function adaptations
from C to Perl. It's fundamentally a wrong approach
for anything more complex than Hello World.
I have even stronger misgivings about taking code that
uses pointers in C and adapting it to use references
in Perl. While it's true that Perl's references are
conceptually analogous to C's pointers, it is NOT true
that what is done with pointers in C should usually be
done with references in Perl. Occasionally, yes, but
not usually and certainly not always -- and probably
not in this case. A lot of the things that are done
in C with pointers are fiddly low-level details that
in Perl are best left to the optimizer to sort out.
Once you start down the path of trying to do a
line-for-line translation, you will end up using
hashes for structs, using references for pointers,
implementing a linked-list structure with a
$member{next} reference in each anonymous hash,
instead of just using a list for crying out loud.
This is an extreme example, but it demonstrates the
point: structure-for-structure translations from
C to Perl can be very suboptimal, both in terms of
performance and programmer time.
All of that is to say, perhaps you would get further
by asking what is the best or fastest way to find the
longest repeated substrings of a string in Perl. I
personally don't know the answer to this question, but
someone on Perlmonks might, and you're more likely to
get a satisfactory answer to that one than the question
that you posted, IMO.
Quidquid latine dictum sit altum viditur.
| [reply] |
|
|