nested combinations: algorithm advice?

revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 22, 2004 at 09:31 UTC
Like Ted said, it's good to tell us what the code is supposed to do, instead of making us guess. Having said that, I did make a guess - the code seems to me to pick unique combinations of words where there are more than one matches? This code produces the same results, anyway, and seems to be quicker. Less looping. I don't know what would happen if you had one_one_one and one_one_one, though. `chomp(my @lines = <DATA>); my %seen; for my $line1 (@lines) { for my $line2 (grep $_ ne $line1, @lines) { if (1 < matches($line1, $line2)) { $seen{ join ' and ', sort $line1, $line2 } = 1; } } } print "$_\n" for keys %seen; sub matches { my ($line1, $line2) = @_; my %words1 = map {$_ => 1} split /_/, $line1; my $matches = 0; $matches += ($words1{$_}\|\|0) for split /_/, $line2; return $matches; }` [download] I'm probably missing something obvious.. edit: s/much quicker/quicker/ :) I made a mistake in my benchmarking.	[reply] [d/l]
Re^2: nested combinations: algorithm advice? by revdiablo (Prior) on Sep 22, 2004 at 16:44 UTC
Thanks for the reply, and good guess! I've updated my original post to explain a bit more about what I'm trying to do. I like your idea, but note one thing about your solution -- it compares lines to eachother twice. When it's on line 1, it compares to line 2, and when it's on line 2, it compares to line 1 (even though it already has). That's why tye's `combinations` sub is really nice, it cleanly eliminates that problem. I'm not sure if using `combinations` is faster or slower (intuition tells me it would be faster, since it's comparing less, but perhaps there's some overhead getting in the way), but the duplication was driving me crazy.	[reply] [d/l] [select]
Re^3: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 22, 2004 at 16:54 UTC
Yes, I tried doing the seen sort join thing before the matches thereby avoiding the 2 - 1, 1 - 2 overhead, but with the data given, there was no benefit (in fact it was slower). I did benchmark your code against mine, and on the DATA, mine was about 40% faster. Possibly on a much more extensive set of data, the difference wouldn't be so great.	[reply]
Re^4: nested combinations: algorithm advice? by revdiablo (Prior) on Sep 22, 2004 at 17:07 UTC
Re: nested combinations: algorithm advice? by BrowserUk (Patriarch) on Sep 22, 2004 at 09:46 UTC
Without further adieu, ... Please! More adieu? Update: My guess as to the relationship between the inputs and the outputs is: Produce 1 random pairing of the inputs for each input; where the pairings contain at least one sub element in common? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^2: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 22, 2004 at 10:50 UTC
Produce 1 random pairing of the inputs for each input; where the pairings contain at least one sub element in common? It's good, but it's not right</roy walker> I don't think there's anything random about it at all. I've added some more lines into the DATA, and still think my guess is right. edit: after a bit of googling for Roy Walker, I changed from "it's close, but it's not right." I really think I should have more important things to worry about.	[reply]
Re: nested combinations: algorithm advice? by bmann (Priest) on Sep 22, 2004 at 15:58 UTC
You want to compare each line to every other line and print only those with more than one term in common, right? Assuming that's what you're looking for, you don't need to worry about combinations. Compare the first element with every other element. Then compare the second with every other element except the first. Repeat until complete. Here's a possibly cleaner way to the same result. If your data is more complex (it probably is ;) you may need to tweak the regex. `#!/usr/bin/perl use strict; use warnings; chomp (my @lines = <DATA>); for my $i (0 .. $#lines) { (my $re = $lines[$i]) =~ s/_/\|/g; # create $re outside the inner l +oop for my $j ($i + 1 .. $#lines) { my $count = () = $lines[$j] =~ /$re/g; if ($count >= 2) { print "$lines[$i] and $lines[$j]\n"; } } } __DATA__ one_two one_three_two three_one one_four four_three_one` [download]	[reply] [d/l]
Re^2: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 22, 2004 at 17:11 UTC
Erg, teach me to start optimising in the most stupdi places. I have modded my code to loop like yours, and moved my hash creation outside the inner loop (what a dummy!) and ditched the sub (which was mostly there for clarity, anyway). Then we get: `my $out; for my $i (0 .. $#lines) { my %words1 = map { $_ => 1 } split /_/, $lines[$i]; for my $j($i + 1 .. $#lines) { if (1 < grep $words1{$_}, split /_/, $lines[$j]) { $out .= "$lines[$i] and $lines[$j]\n"; } } } $out;` [download] Faster! Benchmark: `Benchmark: running bmann, jasper, rev for at least 3 CPU seconds... bmann: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 88 +28.71/s (n=27987) jasper: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 10 +860.32/s (n=34210) rev: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 22 +06.94/s (n=6996)` [download]	[reply] [d/l] [select]
Re^2: nested combinations: algorithm advice? by revdiablo (Prior) on Sep 22, 2004 at 16:50 UTC
Excellent! I thought there might be a nicer way than the brute-force combinations. Also, using regexes instead of looping on the word lists is nice -- let the regex engine do the dirty work. Thanks for the reply, I will probably steal your ideas. :)	[reply]
Re^3: nested combinations: algorithm advice? by bmann (Priest) on Sep 22, 2004 at 18:31 UTC
Glad you liked it. One caveat though - craft your regex with care if you use this (or use a hash like jasper's version below, which doesn't suffer from this problem). The regex I used was naive in that it doesn't detect the boundary between the words. Add "twop_threep" to DATA, you'll see a false positive.	[reply]
Re: nested combinations: algorithm advice? by BrowserUk (Patriarch) on Sep 22, 2004 at 19:00 UTC
If you have thousands of lines to process this may be quicker than some other approaches. It will process 100,000 lines (of 2 to 4 elements each) in under 4 seconds. I won't attempt to categorise the algorithm because it relies on using hashes for lookups, and there is always the argument about whether they are O(1) or 0(logN). Update: On emperical evidence this is O(N) (at ~4secs/100,000) upto the limit of memory. #! perl -slw use strict; our $LINES \|\|= 1000; ## Gen some test data to a simulated file my @a = qw[ one two three four five six seven eight nine ];; my @lines = map{ join '_', @a[ map{ rand @a } 1 .. 2+rand 3 ] } 1 .. $ +LINES; warn "Processing " . @lines . " start: " . localtime; ## Index the lines my %index; for my $idx ( 0 .. $#lines ) { $index{ $_ }{ $idx }++ for split '_', $lines[ $idx ]; } my @keys = keys %index; while( my $first = pop @keys ) { for my $second ( @keys ) { print "$first/$second: ", join "\n\t", @lines[ grep{ exists $index{ $second }{ $_ } } keys %{ $index{ $first } } ]; } } warn "Processing " . @lines . " stop: " . localtime; __END__ P:\test>392861 -LINES=100000 >nul Processing 100000 start: Wed Sep 22 19:58:46 2004 at P:\test\392861.pl + line 10. Processing 100000 stop: Wed Sep 22 19:58:50 2004 at P:\test\392861.pl + line 31. [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l]
Re: nested combinations: algorithm advice? by TedPride (Priest) on Sep 22, 2004 at 07:56 UTC
What's your algorithm supposed to do? It's easier for us if we don't have to guess :)	[reply]
Re: nested combinations: algorithm advice? by perlfan (Vicar) on Sep 22, 2004 at 13:48 UTC
2 nested loops is bad, much less 4. I assure you that rethinking the problem with the proper data structure will make the iterative complexity much less. Since it looks like you are doing something with permuations, check out How do I permute N elements of a list? Provide a description of your problem, and like the monks above said, we may be able to help a bit more...	[reply]
Re^2: nested combinations: algorithm advice? by revdiablo (Prior) on Sep 22, 2004 at 17:39 UTC
Forward: I hope this reply isn't taken the wrong way. I'm not trying to flame, just trying to further explain things. I tried to be as civil as possible, and if I've failed, I'm sure I'll know it by all the downvotes. 8^) I assure you that rethinking the problem with the proper data structure will make the iterative complexity much less. Rethinking the problem is why I posted a question in the first place. I needed a fresh perspective on things. If you have any ideas about what data structure would help, I'm all ears. Since it looks like you are doing something with permuations Actually, I'm working with combinations. And only the 2-element combinations, at that. For all I know, there may be another more appropriate term to describe these things. From my understanding, though, "permutations" is not it. Update: there's a nice description of the difference at Iterating over combinations. Provide a description of your problem, and like the monks above said, we may be able to help a bit more I apologize for this. I posted the original question in haste, and apparently forgot to mention a few important things.	[reply]
Re: nested combinations: algorithm advice? by Limbic~Region (Chancellor) on Sep 22, 2004 at 19:21 UTC
revdiablo, My idea was: For each word, keep track of what line it appears on For each line, iterate over pairs of words (notice I created my own combination iterator) For each each pair, get the intersection of lines If the intersection was more than 1 line, lookup the lines and print them #!/usr/bin/perl use strict; use warnings; my (%word, %seen); chomp ( my @line = <DATA> ); for my $index ( 0 .. $#line ) { $word{ $_ }{ $index } = undef for split /_/ , $line[ $index ]; } for ( @line ) { my $iter = by_two( $_ ); while ( my @comb = $iter->() ) { my @matches = map { exists $word{ $comb[ 0 ] }{ $_ } ? $_ : () + } keys %{ $word{ $comb[ 1 ] } } ; next if @matches < 2; my $output = join ' and ' , map { $line[ $_ ] } sort { $a <=> +$b } @matches; next if $seen{ $output }++; print "$output\n"; } } sub by_two { my @list = split /_/ , shift; return sub { () } if @list < 2; my ($start, $stop, $pos, $done) = (0, $#list, 0, undef); return sub { return () if $done; $pos++; if ( $pos > $stop ) { $start++; $pos = $start + 1; } $done = 1 if $start == $stop - 1; return $list[ $start ], $list[ $pos ]; } } __DATA__ one_two one_three_two three_one one_four four_three_one [download] You will notice I have 3 lines of output instead of 5. That is because instead of breaking 3 matches into pairs, I put all 3 on the same line. If you wanted to force the pair issue you could do so using the by_two iterator routine. Finally, it likely could be made more efficient - but hey, I am on a coding hiatus ATM. Cheers - L~R	[reply] [d/l]
Re: nested combinations: algorithm advice? by Anonymous Monk on Sep 22, 2004 at 16:10 UTC
The word you want is ado, not "adieu". "Adieu" is French for "farewell" (and literally translates to "to/for God").	[reply]
Re: nested combinations: algorithm advice? by Pragma (Scribe) on Sep 23, 2004 at 05:19 UTC
`#!/usr/bin/perl -lanF_ "@F"=~$r[$_-1]&&print"$.,$_"for 1..@r;$_="{@F}.+{@F}";y/ /,/;push@r,jo +in'\|',glob` [download]	[reply] [d/l]
Re^2: nested combinations: algorithm advice? by Jasper (Chaplain) on Sep 23, 2004 at 09:44 UTC
`#!/usr/bin/perl -lapF_ for$p(@p){$;.="$_ and $p "if 1<grep"@F"=~$_,split/_/,$p}push@p,$_}{$_=$` [download]	[reply] [d/l]
Re^3: nested combinations: algorithm advice? by Pragma (Scribe) on Sep 24, 2004 at 07:57 UTC
What?! No glob? Bah! Ameteur... :-) ps. I believe you are missing a semicolon.	[reply]