Common Substrings

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Common Substrings by Corion (Patriarch) on Nov 15, 2005 at 10:23 UTC
How about `sub common { if ( $_[0] eq $_[1] ) { return( $_[0], "", "" ); } else { use bytes; my $x = $_[0] ^ $_[1]; $x =~ m!^(\x00)!; my $pos = length $1; my ( $len0, $len1 ) = ( length($_[0])-$pos, length($_[1])-$pos ) +; return( unpack("a${pos}", $_[0]), unpack("x${pos}a", $_[0]), unpack("x${pos}a*", $_[1]), ); } }` [download] That way you scan the two strings, but you do it in C instead of doing it in Perl.	[reply] [d/l]
Re^2: Common Substrings by Anonymous Monk on Nov 15, 2005 at 10:31 UTC
Nice use of xor! And should be better than my loop. Still, my concern is breaking multibyte characters.	[reply]
Re: Common Substrings by salva (Canon) on Nov 15, 2005 at 10:48 UTC
and this may be the most efficient way to do it in perl: `sub common { my ($a, $b) = @_; my $min = 0; my $max = length $a < length $b ? length $a : length $b; while ($min < $max) { my $h = ($max - $min + 1) >> 1; my $c = substr($a, $min, $h); my $d = substr($b, $min, $h); if ($c eq $d) { $min += $h; } else { $max = $min + $h - 1; } } return (substr($a, 0, $min), substr($a, $min), substr($b, $min)); }` [download]	[reply] [d/l]
Re^2: Common Substrings by Anonymous Monk on Nov 15, 2005 at 11:19 UTC
Thanks! It took me a bit to understand that you were using a bisection method. But I agree that it should be reasonably efficient. I just have a tendency to avoid substr() since it always feels like a very expensive call. The strings that I am spliting are paths, and because my sub then looks for the place of a slash, which should be 8 bits, I think that it is safe for me to use a binary method.	[reply]
Re^3: Common Substrings by salva (Canon) on Nov 15, 2005 at 11:51 UTC
I just have a tendency to avoid substr() since it always feels like a very expensive call. `substr` is actually quite cheap as it doesn't copy the char data from the the original string, it just makes an alias to it. The strings that I am spliting are paths, and because my sub then looks for the place of a slash, which should be 8 bits I don't think so, on UTF8 strings byte offsets and char offsets can be different!	[reply] [d/l]
Re: Common Substrings by salva (Canon) on Nov 15, 2005 at 10:27 UTC
you should use `substr` instead of `vec` to extract characters from a string: `sub common { my ($a, $b)=@_; return ($a, '', '') if $a eq $b; for (my $i=0;;$i++) { my $c=substr($a, $i, 1); my $d=substr($b, $i, 1); return (substr($a, 0, $i), $c, $d) if $c ne $d; } }` [download]	[reply] [d/l] [select]
Re: Common Substrings by Moron (Curate) on Nov 15, 2005 at 14:04 UTC
A.k.a. "how to handle a non-nested concurrent loop"... `sub common{ my @x = split( "", shift()); my $result = ''; my $x = shift @x; for my $y ( split( "", shift()) ) { ( $x eq $y ) or last; $result .= $y; $x = shift @x or last; } return $result; }` [download] -M Free your mind	[reply] [d/l]
Re: Common Substrings by blazar (Canon) on Nov 15, 2005 at 11:46 UTC
Well, here's a regexp based approach. Of course if you know in advance that a certain charachter or substring won't ever be in your data, you can avoid the `$sep` machinery: `sub common { my $sep=''; $sep \|= $_ for map /:+/g, @_; $sep="<$sep:>"; local $_=join $sep, @_; /^(.)(.)$sep\1(.)/; }` [download] Update:* I hadn't noticed that you were focusing on efficiency. And from that POV I'm quite confident that this solution won't score very well. But as far as your concern about your code looking like C goes, this is certainly more perlish. Incidentally I would have used myself an approach like Corion's one but probably matching on `/[^\0]/`, using pos and substr instead of unpack.	[reply] [d/l] [select]
Re: Common Substrings by Anonymous Monk on Nov 15, 2005 at 15:09 UTC
After some benchmarking... I will use the xor approach (fast!). I stand corrected about substr(), it is much faster than unpack, so I will also use it. Last but not least, using pos() will help as well (even if the tests were ambiguous between length() and pos(), to my mind the code gains in expressiveness). Thanks to you all! For completeness, I post my sub below. It is the backend of another that creates relative paths for a fixlinks util. After reading the chapter on unicode of the camel book, I conclude that this sub will do the rigth thing whatever locale, since Perl strings are either latin1 or utf8 encoded, and these are both ascii transparent, the requirement to look for a slash after deciding $pos does the trick. However, it is a shame that it cannot be generalized to a common substring function like the one I presented above. sub SLASH() { 47 } sub _common_path { if ( $_[0] eq $_[1] ) { return( $_[0], # all path components are common "", # nothing remains of first "", # nothing remains of second ); } else { use bytes; my ( $len0, $len1 ) = ( length($_[0]), length($_[1]) ); # find the offset of the first byte that differs my $pos = $_[0] ^ $_[1]; $pos =~ m/[^\x00]/g; $pos = pos($pos) - 1; # if some bytes are common but the last one wasn't the separator # we must decide which path components are common if ( $pos > 0 && vec($_[0], ($pos - 1), 8) != SLASH ) { # check if first path is just longer than the second if ( $pos == $len1 && vec($_[0], $pos, 8) == SLASH ) { $pos++; return( substr($_[0], 0, $pos), # common path with slash substr($_[0], $pos), # extra in first "", # nothing remains of second ); } # check if second path is just longer than the first if ( $pos == $len0 && vec($_[1], $pos, 8) == SLASH ) { $pos++; return( substr($_[1], 0, $pos), # common path with slash "", # nothing remains of first substr($_[1], $pos), # extra in second ); } # otherwise, rewind until last common path component while ( $pos > 0 ) { $pos--; if ( vec($_[0], $pos, 8) == SLASH ) { $pos++; # and keep the common slash last; } } } return( substr($_[0], 0, $pos), # common path components (with slash) substr($_[0], $pos), # extra in first substr($_[1], $pos), # extra in second ); } } [download]	[reply] [d/l]
Re^2: Common Substrings by salva (Canon) on Nov 15, 2005 at 15:32 UTC
The scalars returned by `substr` inside a block where `use bytes` holds, never have the utf8 flag set. For instance: `$ perl -de 1 ... DB<43> $a="\x{1234}/foo" DB<44> x ord substr $a, 0, 1 0 4660 DB<45> sub bsubstr { use bytes; substr $_[0], $_[1], $_[2] } DB<46> x ord bsubstr $a, 0, 1 0 225 DB<47> x ord bsubstr $a, 1, 1 0 136 DB<48> x ord bsubstr $a, 2, 1 0 180 DB<49> x ord bsubstr $a, 3, 1 0 47` [download]	[reply] [d/l] [select]
Re^3: Common Substrings by Anonymous Monk on Nov 15, 2005 at 15:39 UTC
Does that mean that I simply need `no bytes;` before returning the values?	[reply] [d/l]
Re^4: Common Substrings by salva (Canon) on Nov 15, 2005 at 16:00 UTC
Re: Common Substrings by robin (Chaplain) on Nov 15, 2005 at 21:15 UTC
This code; `sub common_path { my ($p, $q) = ("$_[0]/", "$_[1]/"); (my $pos = ($p ^ $q)) =~ /[^\0]/g; my $i = 1 + rindex($p, "/", pos($pos) - 2); return ( substr($p, 0, $i), substr($p, $i, -1), substr($q, $i, -1) ); }` [download] behaves the same as the last one you posted. It's more efficient as well as simpler (according to my benchmarks). I don't think there's any good reason to `use bytes` for this. Were you doing it for efficiency reasons?	[reply] [d/l]
Re^2: Common Substrings by Anonymous Monk on Nov 16, 2005 at 04:24 UTC
Thank you, that is what I was looking for! It only needs a test for equality at the beginning to behave the same. Otherwise there is no match and `pos()` will return `undef`, which will generate the relevant warning and a zero `$i`. (A test on the match could also resolve this.) Plus, doing it this way makes a separate routine to find the common path redundant for the purposes of generating relative paths, as there are no byte/character issues. In fact, the `use bytes` was only needed to determine `$len0` and `$len1`, and should have been enclosed: `do{use bytes;length $string;}` is recommended in the camel book. Some comments on what I meant by efficient... Needless to say that my program will spend most of its life busy with filesystem interaction, and there isn't much that can be done in this area. In fact, any improvement on the relative path generation may not even be noticed when doing a recursive: `fixlinks /` My first implementation used `split()` in a similar way to the answer given above by Moron. But I wasn't happy with the need to change the structure of the arguments data; i.e. if I receive scalars and compute another scalar, there must be a very good reason to transform the arguments into lists! In my opinion this is a reasonable concern, that is only about taking care in the solution to a given problem. (Maybe I should state that I have never been a C programmer.) Anyway, this got me thinking about the way `coreutils/lib/canonicalize.c` does things, and took me to a `use bytes` solution. If I wasn't under the wrong impression that `substr()` was expensive, I could have written a solution like the method by salva, and after time might have remembered bisection. (The irony being that one is generating a list of numbers that converge to the desired position.) Still, the xor approach by Corion is the best. As you say there is no good reason to `use bytes`, it is just a case (for me, very instructional) when a binary operation helps processing strings. Regards to All!	[reply] [d/l] [select]
Re: Common Substrings by jonadab (Parson) on Nov 15, 2005 at 12:32 UTC
The perlish way is to just `use LCSS;` and let somebody else worry about how it works exactly. As far as the "without using bytes", why does that matter? Is this some kind of contrived CS undergrad assignment or something? Whether the solution uses bytes or not is an irrelevant detail of the implementation, something you would not, in any normal situation, ever care about. Then you qualify your question with "that is efficient", but only a C programmer would ask that question without concrete evidence that the working implementation creates performance problems. Premature optimization is a root of all kinds of evil (for which some have strayed... and pierced themselves through with many sorrows). Get it working, get it working right, and only then worry about efficiency if it becomes a problem.	[reply] [d/l]
Re^2: Common Substrings by tphyahoo (Vicar) on Nov 15, 2005 at 17:06 UTC
What is LCCS? UPDATE: oh, Algorithm::LCSS	[reply]
Re: Common Substrings by Roy Johnson (Monsignor) on Nov 15, 2005 at 13:51 UTC
Update: I basically duplicated blazar's code. Deleted to save your time. Caution: Contents may have been coded under pressure.	[reply]
Re: Common Substrings by murugu (Curate) on Nov 15, 2005 at 15:15 UTC
Hi, Here is a brute force try.... `sub common { my ($a,$b)=@_; my ($long,$short); if (length ($a) <= length ($b)){ ($long,$short) = ($b,$a); } else { ($long,$short) = ($a,$b); } my $i=0; my $common=""; while ($i<length($short)){ my $str = substr($short,$i,1); my $t=$i; $i++; next unless ($long=~/$str/); while ($long=~/$str/) { $t++; last unless (my $uni = substr($short,$t,1)); last unless ($long=~m/$str$uni/); $str.=$uni; } $common = $str if (length($common)<length($str)); } return $common; }` [download] Regards, Murugesan Kandasamy use perl for(;;);	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks