suffix arrays

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: suffix arrays by Zaxo (Archbishop) on Jul 18, 2003 at 20:59 UTC
A reference to substr's return value has the properties you want, `my $foo = "A string to work on."; my @substrs = map {\substr $foo, $_} 0 .. length($foo) - 1; my @sorted = sort { $$a cmp $$b } @substrs;` [download] That the reference is to the original string is shown by: `$ perl -e '$foo = "A short string\n"; $bar = \substr $foo, 0, 1; $$bar + = "The"; print $foo' The short string $` [download] After Compline, Zaxo	[reply] [d/l] [select]
Re: Re: suffix arrays by BrowserUk (Patriarch) on Jul 18, 2003 at 21:17 UTC
Unfortunately, as of 5.8.0, that still doesn't work properly. `perl> print join "\n", map{ \substr 'The quick brown fox', $_ } 0 .. +19 LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4) LVALUE(0x15d7cf4)` [download] As you can see, each time you take an Lvalue from substr, it re-uses the same address. So you end up with and array of pointers to the last char in the string. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: Re: suffix arrays by Zaxo (Archbishop) on Jul 18, 2003 at 21:53 UTC
Hmmm... In 5.8.1-RC2, `perl -e'my $foo = "A string"; my @foo = map{\substr $foo, $_} 0..7; pr +int map {$$_,$/} @foo;' A string string string tring ring ing ng g $` [download] 5.8.1 does more of what I mean! After Compline, Zaxo	[reply] [d/l]
Re: Re: Re: Re: suffix arrays by BrowserUk (Patriarch) on Jul 18, 2003 at 22:13 UTC
Re: suffix arrays by BrowserUk (Patriarch) on Jul 18, 2003 at 21:27 UTC
You can get away with using the Lvalues from substr in the array as suggested by Zaxo above, but you have to jump through a few hoops to prevent the problem I alluded to. Basically, there seems to be a single slot allocated for all the Lvalues generated by any given statement that includes a substr. That's a clumsy way to put it, but I can't think of a better way to say it. To work around the limitation, it is necessary to use a 'different' substr each time, hence the need for the eval in this code. The downside is that the eval markly reduces the efficiency. The algorithm works fine though and this implementation shoudl be a bit quicker than yours. #! perl -slw use strict; sub ATOa{ map{ eval '\substr( $_[0], $_ )' } 0 .. length( $_[0] ); } my @Aoa = sort{ $$a cmp $$b } ATOa $ARGV[0]; #print $$_ for @Aoa; my $max = ''; for my $i ( 0 .. ($#Aoa-1) ) { my $p = 0; ++$p while substr( ${$Aoa[ $i ]}, 0, $p ) eq substr( ${$Aoa[ $i+1 ]}, 0, $p ); $max = substr( ${$Aoa[ $i ]}, 0, $p-1 ) if length $max < $p; } print "The longest common substring in\n'$ARGV[0]'\n is '$max'"; __DATA__ P:\test>test "the infernal invention was the source or eternal facinat +ion for the qualified professional and enthusiastic amateur alike" The longest common substring in 'the infernal invention was the source of eternal facination for the q +ualified professional and enthusiastic amateur alike' is 'ernal ' P:\test>test "Four and twenty virgins came down from Inverness. And wh +en they went back again there were four and twenty less" The longest common substring in 'Four and twenty virgins came down from Inverness. And when they went +back again there were four and twenty less' is 'our and twenty ' [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply] [d/l]
Re: Re: suffix arrays by sauoq (Abbot) on Jul 19, 2003 at 00:15 UTC
The downside is that the eval markly reduces the efficiency. So, just do one `eval()` . . . `sub ATOa { my $code = '(' . join( ',', map { "\\substr(\$_[0], $_)" } 0 .. length( +$_[0] ) ) . ')'; return eval $code; }` [download] :-) -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: suffix arrays by cees (Curate) on Jul 19, 2003 at 06:20 UTC
You can improve the performance slightly by only looking for matching strings larger than the current maximum. `my $p = length($max);` [download] - Cees	[reply] [d/l]
Re: suffix arrays by sauoq (Abbot) on Jul 18, 2003 at 20:54 UTC
I'm a little confused about what you are asking for. You want to search a set of strings and try to find common suffixes? Your implementation doesn't seem to do that. Your implementation also isn't very efficiently written. Your sort of `@arr`, for instance, could benefit from a Schwartzian Transform. Looking into my crystal ball... I have a feeling hashes will play a part in your answer. Update: (After some research at ddj.com...) Oh, I see. You are talking about doing substring searches using suffix trees... Frankly, I'm not sure why you would bother to do that in Perl where `index()` and `rindex()` essentially do it for you. :-) Update: To be fair, there are some nifty things you can do with suffix trees that aren't as easy as using `index()`. That doesn't necessarily mean that using suffix trees is the best way to do them in Perl though. Do you have a specific application in mind or were you just hoping to play with suffix trees in Perl? -sauoq "My two cents aren't worth a dime.";	[reply]
Re: suffix arrays by jonadab (Parson) on Jul 19, 2003 at 14:45 UTC
I took a shot at doing this, based on the DDJ code This will generally not work very well. You asked for wisdom, so here is some: Perl is not C. What is fast in C may be slow in Perl, and vice versa, but even more importantly, many "algorithms" designed for C are designed to take detailed control over minutia. In many cases they are not so much algorithms as implementations, or perhaps (as in this case) something of each. I generally have misgivings when I see people trying to do line-for-line, function-for-function adaptations from C to Perl. It's fundamentally a wrong approach for anything more complex than Hello World. I have even stronger misgivings about taking code that uses pointers in C and adapting it to use references in Perl. While it's true that Perl's references are conceptually analogous to C's pointers, it is NOT true that what is done with pointers in C should usually be done with references in Perl. Occasionally, yes, but not usually and certainly not always -- and probably not in this case. A lot of the things that are done in C with pointers are fiddly low-level details that in Perl are best left to the optimizer to sort out. Once you start down the path of trying to do a line-for-line translation, you will end up using hashes for structs, using references for pointers, implementing a linked-list structure with a $member{next} reference in each anonymous hash, instead of just using a list for crying out loud. This is an extreme example, but it demonstrates the point: structure-for-structure translations from C to Perl can be very suboptimal, both in terms of performance and programmer time. All of that is to say, perhaps you would get further by asking what is the best or fastest way to find the longest repeated substrings of a string in Perl. I personally don't know the answer to this question, but someone on Perlmonks might, and you're more likely to get a satisfactory answer to that one than the question that you posted, IMO. Quidquid latine dictum sit altum viditur.	[reply]


XP is just a number
	PerlMonks