Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Finding Combinations of Pairs

by zod (Scribe)
on Jan 14, 2009 at 03:21 UTC ( [id://736135]=perlquestion: print w/replies, xml ) Need Help??

zod has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have a file that has a list of words on each line, let's say:

dog monkey cat
cat ball stone
monkey iron cat zoo

What I need to do is find out which two-word combinations (order does not matter) on each line are most common in the file.

So I need to generate all pair combinations for each line and then figure out which pair combo is the most common over all the lines.

So, the first line should generate:

dog monkey
dog cat
monkey cat

I figured I should dump each line into an array and then generate the pair combos for each array. So, I started with tye's Finding all Combinations node:

http://www.perlmonks.org/?node_id=128293

But, alas, I'm stuck already as I'm not sure how to generate only _pair_ combinations rather than every possible combo.

Can anyone point me in the right direction? My spidey sense tells me I should use a hash. But not sure where to start...

Thanks much,

Zod

Replies are listed 'Best First'.
Re: Finding Combinations of Pairs
by kyle (Abbot) on Jan 14, 2009 at 03:46 UTC

    I think your spidey sense serves you well.

    This smells a little like homework, so I'm putting my solution in spoiler tags.

    my %count_of; while (<DATA>) { chomp; # sort the words so I don't have "cat-dog" and "dog-cat" my @words = sort split; foreach my $i ( 0 .. $#words - 1 ) { foreach my $j ( $i+1 .. $#words ) { $count_of{ $words[$i] . q{-} . $words[$j] }++; } } } my $top_pair = ( sort { $count_of{$a} <=> $count_of{$b} } keys %count_ +of )[-1]; print "$top_pair ($count_of{$top_pair})\n"; __DATA__ dog monkey cat cat ball stone monkey iron cat zoo

    You should come up with your pairs, and for each pair keep a count (in a hash) of how many times that pair has been seen. Each key of the hash will be a pair, and each value will be the count of occurrences. Once you have that, you can sort the keys by the values and come up with the pair that appears the most. I suspect every one of these tasks has been the subject of some post you could find with Super Search.

      Tradition says that when you answer homework you obfuscate your answer somewhat so that it won't get a good mark. Like this:
      #! /usr/bin/perl -w use strict; my %pair_count; for (<>) { $pair_count{$_}++ for get_pairs(grep length $_, split /\s/, $_); } my $comp = sub ($$) {-($pair_count{$_[0]} <=> $pair_count{$_[1]})}; print((sort $comp keys %pair_count)[0], $/); sub get_range_iterator { my ($start, $end) = @_; return sub { if ($start <= $end) { return $start++; } else { return; } } } sub get_pairs { my $outer_iter = get_range_iterator(0, $#_); my $i = $outer_iter->(); my @answer; while (defined($i)) { my $inner_iter = get_range_iterator($i+1, $#_); my $j = $inner_iter->(); while (defined $j) { push @answer, join " ", sort @_[$i, $j]; $j = $inner_iter->(); } $i = $outer_iter->(); } return @answer; }
        ...when you answer homework...

        Or else you offer an answer that appears to be well laid out and straight-forward, but scales abysmally...

        • another intruder with the mooring in the heart of the Perl

        tilly,
        Tradition says that when you answer homework you obfuscate your answer somewhat so that it won't get a good mark.

        It doesn't look too much like homework to me but another idea would be to use code obviously beyond that of the course?

        Cheers - L~R

        I like this solution, because it's so simple and efficient:

        #!/usr/bin/perl -w use strict; my @lines = <DATA>; my (%words, %count); @words{map {split ' '} @lines} =1; my @uniq = keys %words; # create list of unique words foreach my $i (0 .. $#uniq - 1) { foreach my $j ($i+1 .. $#uniq) { # count the pairs do { $count{"$uniq[$i] $uniq[$j]"}++ if /$uniq[$i]/ an +d /$uniq[$j]/ } foreach @lines; } } print map "$_ : $count{$_}\n", # print the pairs in s +orted order sort {$count{$a} <=> $count{$b}} keys %count; __DATA__ dog monkey cat cat ball stone monkey iron cat zoo

        (NB see parent node)

      Thanks for the reply. I didn't consider combining the pairs into a single value. Next time I will. Thanks.
        BTW - and being serious now - you don't actually need to combine them into a single value.

        If you do something like:

        $h->{$word1}->{$word2}++
        instead of
        $h->{"$word1 $word2"}++
        in the centre of the loops, then you end up with a useful tree-like structure, which gives you a quick index of, for each word, how many times each other word appears in combination with it.

        Try it and use Data::Dumper to print the output, and you'll see what I mean.

        Although I appreciate that may not be what the homework called for. <grin>

        Best wishes, andy.

Re: Finding Combinations of Pairs
by BrowserUk (Patriarch) on Jan 14, 2009 at 08:52 UTC

    #! perl -slw use strict; sub pairs { map { my $i = $_; map [ @_[ $i, $_ ] ], $i+1 .. $#_; } 0 .. $#_; } my %pairCounts; ++$pairCounts{ "@$_" } for map pairs( sort split ), <DATA>; print "@$_" while @{ $_ = [ each %pairCounts ] }; __DATA__ dog monkey cat cat ball stone monkey iron cat zoo

    Gives:

    C:\test>junk4 cat stone 1 cat monkey 2 iron monkey 1 cat zoo 1 dog monkey 1 monkey zoo 1 cat dog 1 ball cat 1 cat iron 1 iron zoo 1 ball stone 1

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding Combinations of Pairs
by repellent (Priest) on Jan 14, 2009 at 18:23 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://736135]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-04-19 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found