Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Finding all connected nodes in an all-against-all comparison

by Anonymous Monk
on May 06, 2010 at 19:43 UTC ( [id://838787]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I have a dataset that consists of a bunch of elements (DNA sequences) that are compared in an all-against-all manner. In case it is relevant, the matches are not necessarily symmetric, so just because A matches B does not guarantee that B will match A. The output looks like this, where each line indicates a match between the sequence in column A and the sequence in column B based on my search criteria:

Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10

I want to divide this list into groups where a group consists of all the elements that are connected by at least one edge (including non-reciprocal edges). So in the example data, I would have three groups:

Group1: Contig1 Contig2 Contig3 Contig4 Contig5 Group2: Contig6 Contig7 Group3: Contig8 Contig9 Contig10 Contig11

I am at a lost to figure out an efficient way to do this in Perl, and any help would be greatly appreciated.

Replies are listed 'Best First'.
Re: Finding all connected nodes in an all-against-all comparison
by lamprecht (Friar) on May 06, 2010 at 22:00 UTC
    Hi,

    take a look at the Graph module:

    use warnings; use strict; use Data::Dumper; use Graph; my $g = Graph->new(directed => 1); $g->add_vertex("Contig$_") for (1..11); while (my $line = <DATA>){ my @nodes = split( /\s+/,$line ); $g->add_edge( @nodes ); } print Dumper $g->weakly_connected_components; __DATA__ Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10
    Cheers, Christoph
      UPDATE: ARGH sorry this reply was meant for the OP!

      Thats really not trivial and more complicated than I thought because of edge cases (not covered in your sample code)

      So I can fully recommend using an already tested and optimized graph module because all solutions I had so far need to be rewritten for performance reasons...

      Cheers Rolf

Re: Finding all connected nodes in an all-against-all comparison
by rubasov (Friar) on May 06, 2010 at 21:26 UTC
    I don't really know what you mean by "non-reciprocal edges", but if I get it right, then you want want to find all the connected components in a directed graph, so try this:
    use strict; use warnings; use Data::Dump qw( pp ); sub find_parts { my %graph = %{ shift() }; my %seen; my @parts; my $i = 0; my $helper; $helper = sub { my $start = shift; return if $seen{$start}++; push @{ $parts[$i] }, $start; $helper->($_) for @{ $graph{$start} }; }; for ( keys %graph ) { $helper->($_); $i = $#parts + 1; } undef $helper; return @parts; } my %graph; while (<DATA>) { my ( $src, $dst ) = split; push @{ $graph{$src} }, $dst; } pp \%graph; pp find_parts( \%graph ); __DATA__ Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10 foo bar bar foo quux quux
    I haven't tested all the edge cases, but this will give you the idea.

    Hope that helps.

    update: LanX is right, my code above does not work for the case he has shown. Something like this would have been better to implement: Tarjan's strongly connected components algorithm

Re: Finding all connected nodes in an all-against-all comparison
by BrowserUk (Patriarch) on May 06, 2010 at 22:52 UTC

    #! perl -slw use strict; use Data::Dump qw[ pp ]; my %h; while( <DATA> ) { chomp; my( $k, $v ) = split; push @{ $h{ $k } }, $v; push @{ $h{ $v } }, $k; } my @keys = sort{ substr( $a, 6 ) <=> substr( $b, 6 ) } keys %h; my $n = 0; my %offsets = map{ $_ => $n++ } @keys; my %masks; for my $k ( @keys ) { $masks{ $k } //= chr(0)x2; vec( $masks{ $k }, $offsets{ $_ }, 1 ) = 1 for $k, @{ $h{ $k } }; } for my $i ( 0 .. $#keys ) { for my $j ( 0 .. $#keys ) { if( ( $masks{ $keys[ $i ] } & $masks{ $keys[ $j ] } ) ne chr(0 +)x2 ) { $masks{ $keys[ $i ] } |= $masks{ $keys[ $j ] }; } } } my %uniq; $uniq{ $_ } = 1 for values %masks; $n = 0; for my $group ( keys %uniq ) { printf "Group %d : ", ++$n; print join ' ', map{ $keys[ $_ ] } grep{ vec( $group, $_, 1 ) } 0 .. $#keys; } __DATA__ Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10

    Gives:

    c:\test>838787 Group 1 : Contig8 Contig9 Contig10 Contig11 Group 2 : Contig6 Contig7 Group 3 : Contig1 Contig2 Contig3 Contig4 Contig5

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Adding:

      C12 Contig11 C12 Contig5

      to the test data prints:

      Group 1 : Contig6 Contig7 Group 2 : C12 Contig3 Contig4 Contig5 Contig8 Contig9 Contig10 Contig1 +1 Group 3 : C12 Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9 +Contig10 Contig11

      where something like:

      Group 1 : Contig6 Contig7 Group 2 : C12 Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9 +Contig10 Contig11

      is expected, at least according to my understanding of the OP's 'connected by at least one edge (including non-reciprocal edges)' criteria.

      True laziness is hard work

        Indeed. I need to |= the sets both ways:

        (Note: I've changed your C12 to Config12 because it was easier than re-writing the sort Which isn't really necessary anyway, but makes the output nicer.)

        #! perl -slw use strict; use Data::Dump qw[ pp ]; my %h; while( <DATA> ) { chomp; my( $k, $v ) = split; push @{ $h{ $k } }, $v; push @{ $h{ $v } }, $k; } my @keys = sort{ substr( $a, 6 ) <=> substr( $b, 6 ) } keys %h; my $n = 0; my %offsets = map{ $_ => $n++ } @keys; my %masks; for my $k ( @keys ) { $masks{ $k } //= chr(0)x2; vec( $masks{ $k }, $offsets{ $_ }, 1 ) = 1 for $k, @{ $h{ $k } }; } for my $i ( 0 .. $#keys ) { for my $j ( 0 .. $#keys ) { if( ( $masks{ $keys[ $i ] } & $masks{ $keys[ $j ] } ) ne chr( +0)x2 ) { $masks{ $keys[ $i ] } |= $masks{ $keys[ $j ] }; $masks{ $keys[ $j ] } |= $masks{ $keys[ $i ] }; } } } my %uniq; $uniq{ $_ } = 1 for values %masks; $n = 0; for my $group ( keys %uniq ) { printf "Group %d : ", ++$n; print join ' ', map{ $keys[ $_ ] } grep{ vec( $group, $_, 1 ) } 0 .. $#keys; } __DATA__ Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10 Contig12 Contig11 Contig12 Contig5

        Gives:

        c:\test>838787 Group 1 : Contig6 Contig7 Group 2 : Contig1 Contig2 Contig3 Contig4 Contig5 Contig8 Contig9 Cont +ig10 Contig11 Contig12

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Finding all connected nodes in an all-against-all comparison
by GrandFather (Saint) on May 08, 2010 at 02:24 UTC

    This is a somewhat tricky problem because later data may require that separate groups that were formed by earlier data need to be merged. Where the code is a little tricky it's worth being a little long winded, use explicit identifier names and comment each case. Consider:

    use strict; use List::Util; my %nodes; my @groups; while (<DATA>) { chomp; my ($node1, $node2) = split; if (! exists $nodes{$node1} && ! exists $nodes{$node2}) { # New group $nodes{$node1} = @groups; $nodes{$node2} = @groups; push @groups, [$node1, $node2]; next; } if (! exists $nodes{$node1}) { # node1 is part of node2's group push @{$groups[$nodes{$node2}]}, $node1; $nodes{$node1} = $nodes{$node2}; } if (! exists $nodes{$node2}) { # node2 is part of node1's group push @{$groups[$nodes{$node1}]}, $node2; $nodes{$node2} = $nodes{$node1}; } next # Already met this pairing if $nodes{$node1} == $nodes{$node2}; # node1 and node2 are in different groups. Merge the groups my ($group, $nulGroup) = ($nodes{$node1}, $nodes{$node2}); push @{$groups[$group]}, @{$groups[$nulGroup]}; $nodes{$_} = $group for @{$groups[$nulGroup]}; $groups[$nulGroup] = undef; } @groups = grep {defined} @groups; for my $group (0 .. $#groups) { print 'Group', $group + 1, ': ', join (', ', @{$groups[$group]}), +"\n"; } __DATA__ Contig1 Contig2 Contig1 Contig3 Contig2 Contig1 Contig2 Contig3 Contig3 Contig1 Contig3 Contig2 Contig3 Contig4 Contig4 Contig3 Contig4 Contig5 Contig6 Contig7 Contig7 Contig6 Contig8 Contig9 Contig9 Contig10 Contig10 Contig8 Contig10 Contig11 Contig11 Contig10 C12 Contig11 C12 Contig5

    Prints:

    Group1: Contig6, Contig7 Group2: Contig8, Contig9, Contig10, Contig11, C12, Contig1, Contig2, C +ontig3, Contig4, Contig5
    True laziness is hard work
Re: Finding all connected nodes in an all-against-all comparison
by LanX (Saint) on May 07, 2010 at 14:52 UTC
    So which solution do you expect with this data?
    A C B C

    There is no path from A to B, so do you consider them in the same or different groups?

    Or do you have a guaranty that these intersections are not possible?

    Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://838787]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (6)
As of 2024-03-29 05:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found