Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

nested combinations: algorithm advice?

by revdiablo (Prior)
on Sep 22, 2004 at 07:24 UTC ( [id://392861]=perlquestion: print w/replies, xml ) Need Help??

revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks. I have a pretty simple problem, which I was able to solve without much difficulty. The algorithm I'm using seems a bit clunky, though. I wonder if there might be a cleaner way to do this, or perhaps simply a few tweaks on what I've got.

Update: oops! Apparently I forgot to actually explain anything about the problem. Sorry guys. Jasper's guess was correct -- the code tries to find any combinations of lines that have two or more words in common. The purpose is finding duplicates in a list of names that could have their First/Last and Last/First order rearranged, but also could have the Middle name, or any other mess of additional stuff.

Without further adieu ado, here it is:

#!/usr/bin/perl use strict; use warnings; chomp(my @lines = <DATA>); my @words = map [ split /_/ ], @lines; my %matches; my $comb = combinations( 0 .. $#words ); while (my @comb = $comb->()) { next unless @comb == 2; my ($i, $j) = @comb; for my $wi (@{$words[$i]}) { for my $wj (@{$words[$j]}) { $matches{"$i.$j"}++ if $wi eq $wj; } } } for (grep $matches{$_} > 1, keys %matches) { my ($i, $j) = split /\./; print "$lines[$i] and $lines[$j]\n"; } # from 197008 sub combinations { my @list = @_; my @pick = (0) x @list; return sub { my $i = 0; while (1 < ++$pick[$i]) { $pick[$i] = 0; return if $#pick < ++$i; } return @list[ grep $pick[$_], 0..$#pick ]; }; } __DATA__ one_two one_three_two three_one one_four four_three_one

My initial revision had 4 nested for loops, but I was able to use tye's simple combinations subroutine from (tye)Re2: Finding all Combinations to reduce it to just three. I couldn't think of a nice way to use combinations on the 2nd set of loops, though. Perhaps I'm missing something obvious?

Any ideas will be greatly appreciated.

Replies are listed 'Best First'.
Re: nested combinations: algorithm advice?
by Jasper (Chaplain) on Sep 22, 2004 at 09:31 UTC
    Like Ted said, it's good to tell us what the code is supposed to do, instead of making us guess. Having said that, I did make a guess - the code seems to me to pick unique combinations of words where there are more than one matches?
    This code produces the same results, anyway, and seems to be quicker. Less looping. I don't know what would happen if you had one_one_one and one_one_one, though.
    chomp(my @lines = <DATA>); my %seen; for my $line1 (@lines) { for my $line2 (grep $_ ne $line1, @lines) { if (1 < matches($line1, $line2)) { $seen{ join ' and ', sort $line1, $line2 } = 1; } } } print "$_\n" for keys %seen; sub matches { my ($line1, $line2) = @_; my %words1 = map {$_ => 1} split /_/, $line1; my $matches = 0; $matches += ($words1{$_}||0) for split /_/, $line2; return $matches; }

    I'm probably missing something obvious..
    edit: s/much quicker/quicker/ :) I made a mistake in my benchmarking.

      Thanks for the reply, and good guess! I've updated my original post to explain a bit more about what I'm trying to do.

      I like your idea, but note one thing about your solution -- it compares lines to eachother twice. When it's on line 1, it compares to line 2, and when it's on line 2, it compares to line 1 (even though it already has). That's why tye's combinations sub is really nice, it cleanly eliminates that problem. I'm not sure if using combinations is faster or slower (intuition tells me it would be faster, since it's comparing less, but perhaps there's some overhead getting in the way), but the duplication was driving me crazy.

        Yes, I tried doing the seen sort join thing before the matches thereby avoiding the 2 - 1, 1 - 2 overhead, but with the data given, there was no benefit (in fact it was slower).

        I did benchmark your code against mine, and on the DATA, mine was about 40% faster. Possibly on a much more extensive set of data, the difference wouldn't be so great.
Re: nested combinations: algorithm advice?
by BrowserUk (Patriarch) on Sep 22, 2004 at 09:46 UTC
    Without further adieu, ...

    Please! More adieu?

    Update: My guess as to the relationship between the inputs and the outputs is:

    Produce 1 random pairing of the inputs for each input; where the pairings contain at least one sub element in common?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      Produce 1 random pairing of the inputs for each input; where the pairings contain at least one sub element in common?

      It's good, but it's not right</roy walker>

      I don't think there's anything random about it at all. I've added some more lines into the DATA, and still think my guess is right.

      edit: after a bit of googling for Roy Walker, I changed from "it's close, but it's not right." I really think I should have more important things to worry about.
Re: nested combinations: algorithm advice?
by bmann (Priest) on Sep 22, 2004 at 15:58 UTC
    You want to compare each line to every other line and print only those with more than one term in common, right?

    Assuming that's what you're looking for, you don't need to worry about combinations. Compare the first element with every other element. Then compare the second with every other element except the first. Repeat until complete.

    Here's a possibly cleaner way to the same result. If your data is more complex (it probably is ;) you may need to tweak the regex.

    #!/usr/bin/perl use strict; use warnings; chomp (my @lines = <DATA>); for my $i (0 .. $#lines) { (my $re = $lines[$i]) =~ s/_/|/g; # create $re outside the inner l +oop for my $j ($i + 1 .. $#lines) { my $count = () = $lines[$j] =~ /$re/g; if ($count >= 2) { print "$lines[$i] and $lines[$j]\n"; } } } __DATA__ one_two one_three_two three_one one_four four_three_one
      Erg, teach me to start optimising in the most stupdi places. I have modded my code to loop like yours, and moved my hash creation outside the inner loop (what a dummy!) and ditched the sub (which was mostly there for clarity, anyway). Then we get:
      my $out; for my $i (0 .. $#lines) { my %words1 = map { $_ => 1 } split /_/, $lines[$i]; for my $j($i + 1 .. $#lines) { if (1 < grep $words1{$_}, split /_/, $lines[$j]) { $out .= "$lines[$i] and $lines[$j]\n"; } } } $out;


      Faster! Benchmark:
      Benchmark: running bmann, jasper, rev for at least 3 CPU seconds... bmann: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 88 +28.71/s (n=27987) jasper: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 10 +860.32/s (n=34210) rev: 3 wallclock secs ( 3.17 usr + 0.00 sys = 3.17 CPU) @ 22 +06.94/s (n=6996)

      Excellent! I thought there might be a nicer way than the brute-force combinations. Also, using regexes instead of looping on the word lists is nice -- let the regex engine do the dirty work. Thanks for the reply, I will probably steal your ideas. :)

        Glad you liked it. One caveat though - craft your regex with care if you use this (or use a hash like jasper's version below, which doesn't suffer from this problem). The regex I used was naive in that it doesn't detect the boundary between the words. Add "twop_threep" to DATA, you'll see a false positive.
Re: nested combinations: algorithm advice?
by BrowserUk (Patriarch) on Sep 22, 2004 at 19:00 UTC

    If you have thousands of lines to process this may be quicker than some other approaches. It will process 100,000 lines (of 2 to 4 elements each) in under 4 seconds.

    I won't attempt to categorise the algorithm because it relies on using hashes for lookups, and there is always the argument about whether they are O(1) or 0(logN).

    Update: On emperical evidence this is O(N) (at ~4secs/100,000) upto the limit of memory.

    #! perl -slw use strict; our $LINES ||= 1000; ## Gen some test data to a simulated file my @a = qw[ one two three four five six seven eight nine ];; my @lines = map{ join '_', @a[ map{ rand @a } 1 .. 2+rand 3 ] } 1 .. $ +LINES; warn "Processing " . @lines . " start: " . localtime; ## Index the lines my %index; for my $idx ( 0 .. $#lines ) { $index{ $_ }{ $idx }++ for split '_', $lines[ $idx ]; } my @keys = keys %index; while( my $first = pop @keys ) { for my $second ( @keys ) { print "$first/$second: ", join "\n\t", @lines[ grep{ exists $index{ $second }{ $_ } } keys %{ $index{ $first } } ]; } } warn "Processing " . @lines . " stop: " . localtime; __END__ P:\test>392861 -LINES=100000 >nul Processing 100000 start: Wed Sep 22 19:58:46 2004 at P:\test\392861.pl + line 10. Processing 100000 stop: Wed Sep 22 19:58:50 2004 at P:\test\392861.pl + line 31.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: nested combinations: algorithm advice?
by TedPride (Priest) on Sep 22, 2004 at 07:56 UTC
    What's your algorithm supposed to do? It's easier for us if we don't have to guess :)
Re: nested combinations: algorithm advice?
by perlfan (Vicar) on Sep 22, 2004 at 13:48 UTC
    2 nested loops is bad, much less 4. I assure you that rethinking the problem with the proper data structure will make the iterative complexity much less. Since it looks like you are doing something with permuations, check out How do I permute N elements of a list?

    Provide a description of your problem, and like the monks above said, we may be able to help a bit more...

      Forward: I hope this reply isn't taken the wrong way. I'm not trying to flame, just trying to further explain things. I tried to be as civil as possible, and if I've failed, I'm sure I'll know it by all the downvotes. 8^)

      I assure you that rethinking the problem with the proper data structure will make the iterative complexity much less.

      Rethinking the problem is why I posted a question in the first place. I needed a fresh perspective on things. If you have any ideas about what data structure would help, I'm all ears.

      Since it looks like you are doing something with permuations

      Actually, I'm working with combinations. And only the 2-element combinations, at that. For all I know, there may be another more appropriate term to describe these things. From my understanding, though, "permutations" is not it. Update: there's a nice description of the difference at Iterating over combinations.

      Provide a description of your problem, and like the monks above said, we may be able to help a bit more

      I apologize for this. I posted the original question in haste, and apparently forgot to mention a few important things.

Re: nested combinations: algorithm advice?
by Limbic~Region (Chancellor) on Sep 22, 2004 at 19:21 UTC
    revdiablo,
    My idea was:
    • For each word, keep track of what line it appears on
    • For each line, iterate over pairs of words (notice I created my own combination iterator)
    • For each each pair, get the intersection of lines
    • If the intersection was more than 1 line, lookup the lines and print them
    #!/usr/bin/perl use strict; use warnings; my (%word, %seen); chomp ( my @line = <DATA> ); for my $index ( 0 .. $#line ) { $word{ $_ }{ $index } = undef for split /_/ , $line[ $index ]; } for ( @line ) { my $iter = by_two( $_ ); while ( my @comb = $iter->() ) { my @matches = map { exists $word{ $comb[ 0 ] }{ $_ } ? $_ : () + } keys %{ $word{ $comb[ 1 ] } } ; next if @matches < 2; my $output = join ' and ' , map { $line[ $_ ] } sort { $a <=> +$b } @matches; next if $seen{ $output }++; print "$output\n"; } } sub by_two { my @list = split /_/ , shift; return sub { () } if @list < 2; my ($start, $stop, $pos, $done) = (0, $#list, 0, undef); return sub { return () if $done; $pos++; if ( $pos > $stop ) { $start++; $pos = $start + 1; } $done = 1 if $start == $stop - 1; return $list[ $start ], $list[ $pos ]; } } __DATA__ one_two one_three_two three_one one_four four_three_one
    You will notice I have 3 lines of output instead of 5. That is because instead of breaking 3 matches into pairs, I put all 3 on the same line. If you wanted to force the pair issue you could do so using the by_two iterator routine. Finally, it likely could be made more efficient - but hey, I am on a coding hiatus ATM.

    Cheers - L~R

Re: nested combinations: algorithm advice?
by Anonymous Monk on Sep 22, 2004 at 16:10 UTC
    The word you want is ado, not "adieu". "Adieu" is French for "farewell" (and literally translates to "to/for God").
Re: nested combinations: algorithm advice?
by Pragma (Scribe) on Sep 23, 2004 at 05:19 UTC
    #!/usr/bin/perl -lanF_ "@F"=~$r[$_-1]&&print"$.,$_"for 1..@r;$_="{@F}.+{@F}";y/ /,/;push@r,jo +in'|',glob
      #!/usr/bin/perl -lapF_ for$p(@p){$;.="$_ and $p "if 1<grep"@F"=~$_,split/_/,$p}push@p,$_}{$_=$
        What?! No glob? Bah! Ameteur... :-)

        ps. I believe you are missing a semicolon.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://392861]
Approved by davido
Front-paged by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-19 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found