Re: nested combinations: algorithm advice?

You want to compare each line to every other line and print only those with more than one term in common, right?

Assuming that's what you're looking for, you don't need to worry about combinations. Compare the first element with every other element. Then compare the second with every other element except the first. Repeat until complete.

Here's a possibly cleaner way to the same result. If your data is more complex (it probably is ;) you may need to tweak the regex.

#!/usr/bin/perl

use strict;
use warnings;

chomp (my @lines = <DATA>);
for my $i (0 .. $#lines) {
    (my $re = $lines[$i]) =~ s/_/|/g; # create $re outside the inner l
+oop
    for my $j ($i + 1 .. $#lines) {
        my $count = () = $lines[$j] =~ /$re/g;
        if ($count >= 2) {
            print "$lines[$i] and $lines[$j]\n";
        }
    }
}
__DATA__
one_two
one_three_two
three_one
one_four
four_three_one
[download]

Comment on Re: nested combinations: algorithm advice? Download Code

Replies are listed 'Best First'.
Re^2: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 22, 2004 at 17:11 UTC
Erg, teach me to start optimising in the most stupdi places. I have modded my code to loop like yours, and moved my hash creation outside the inner loop (what a dummy!) and ditched the sub (which was mostly there for clarity, anyway). Then we get: `my $out; for my $i (0 .. $#lines) { my %words1 = map { $_ => 1 } split /_/, $lines[$i]; for my $j($i + 1 .. $#lines) { if (1 < grep $words1{$_}, split /_/, $lines[$j]) { $out .= "$lines[$i] and $lines[$j]\n"; } } } $out;` [download] Faster! Benchmark: `Benchmark: running bmann, jasper, rev for at least 3 CPU seconds... bmann: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 88 +28.71/s (n=27987) jasper: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 10 +860.32/s (n=34210) rev: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 22 +06.94/s (n=6996)` [download]	[reply] [d/l] [select]
Re^2: nested combinations: algorithm advice? by revdiablo (Prior) on Sep 22, 2004 at 16:50 UTC
Excellent! I thought there might be a nicer way than the brute-force combinations. Also, using regexes instead of looping on the word lists is nice -- let the regex engine do the dirty work. Thanks for the reply, I will probably steal your ideas. :)	[reply]
Re^3: nested combinations: algorithm advice? by bmann (Priest) on Sep 22, 2004 at 18:31 UTC
Glad you liked it. One caveat though - craft your regex with care if you use this (or use a hash like jasper's version below, which doesn't suffer from this problem). The regex I used was naive in that it doesn't detect the boundary between the words. Add "twop_threep" to DATA, you'll see a false positive.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks