http://qs321.pair.com?node_id=447500

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have a newbie array question. I have two arrays, one contains lots of sequences which may appear more than once, the second contains an array of numbers which relate to the sequences. Both arrays are the same size and $sequence[[0]] relates to $numbers[[0]] etc.

What i want to do is for each copy of a certain sequence in @sequences, get the average of the corresponding values in @numbers.

Hope you can help me out!

#e.g. @sequence = ('acgt','actg','cggt','cggt'); @numbers = ('1234','2345','3244','3455'); # output # sequence: actg = 1789.5 # sequence: cggt = 3349.5

Replies are listed 'Best First'.
Re: finding matches in the same array
by polettix (Vicar) on Apr 13, 2005 at 17:02 UTC
    If you have to evaluate the average for all sequences, you can build the following hash:
    my %occurrences; push @{$occurrences{$sequence[$_]}}, $numbers[$_] foreach (0 .. $#sequence);
    At this point, each entry in the hash has a "sequence" for key and a reference to an array containing the "numbers" for that sequence as value - computing the average should be pretty easy, e.g.:
    my $sum; $sum += $_ foreach (@{$occurrences{'atcg'}}); my $average = $sum / scalar(@{$occurrences{'atcg'}});
    I leave the union of the two snippets as an exercise, just in case it's an homework :)

    Update: fixed a typo in first snipped, thanks to Postular Postulant.

    Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")

    Don't fool yourself.
      let's reduce the loops by calculating the average as we move along:
      my %oc; for (0 .. $#sequence) { $oc{$sequence[$_]}[0] = ($oc{$sequence[$_]}[0]+$numbers[$_])/(++$oc{$sequence[$_]}[1]); } foreach my $seq (keys %oc) { print "The average for $seq is ".$oc{$seq}[0].$/; }
        This does not seem to be an average :) First item is divided by 1, second divided by 2, third by 3... - maybe it's better to make the division at the end:
        my %oc; for (0 .. $#sequence) { $oc{$sequence[$_]}[0] += $numbers[$_]; ++$oc{$sequence[$_]}[1]; } foreach my $seq (keys %oc) { print "The average for $seq is ".($oc{$seq}[0] / $oc{$seq}[1]).$/; }

        Flavio (perl -e "print(scalar(reverse('ti.xittelop@oivalf')))")

        Don't fool yourself.
Re: finding matches in the same array
by Random_Walk (Prior) on Apr 13, 2005 at 17:03 UTC

    I guess this is not homework as the source data is presented as two synchronised arrays not a file to read or array of arrays. If it is homework then I recon the OP already did some work to get from a file of data to two arrays.

    My output disagrees with the OP's (is actg == acgt ??) and you would be better storing the source data in an array of arrays. I have dumped out the hash of hash I build to collate the data so you can see clearly what is going on. Data::Dumper is fantastic when you are developing any sort of interesting data structure.

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; #e.g. my @sequence = ('acgt','actg','cggt','cggt'); my @numbers = ('1234','2345','3244','3455'); my %collated; for (0..$#sequence) { $collated{$sequence[$_]}{total}+=$numbers[$_]; $collated{$sequence[$_]}{number}++; } print Dumper(\%collated); for (sort keys %collated) { print "sequence: $_ = "; print ( $collated{$_}{total} / $collated{$_}{number} ) , "\n"; } __END__ # my output $VAR1 = { 'acgt' => { 'number' => 1, 'total' => 1234 }, 'cggt' => { 'number' => 2, 'total' => 6699 }, 'actg' => { 'number' => 1, 'total' => 2345 } }; sequence: acgt = 1234 sequence: actg = 2345 sequence: cggt = 3349.5

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: finding matches in the same array
by tlm (Prior) on Apr 13, 2005 at 17:07 UTC

    How's this:

    use List::Util 'sum'; my %collect; for my $i ( 0 .. $#@sequences ) { push @{ $collect{ $sequences[ $i ] } }, $numbers[ $i ]; } for my $sequence ( keys %collect ) { my $avg = ( sum @{$collect{$sequence}} )/@{$collect{$sequence}}; printf "sequence: $sequence = %.1f\n", $avg; }

    the lowliest monk

Re: finding matches in the same array
by RazorbladeBidet (Friar) on Apr 13, 2005 at 17:06 UTC
    Here's another idea... a little convoluted...there's probably a nicer way to write it
    $hash{$_} = [ ( $hash{$_}->[0] || 0 ) + $numbers[$i++], ( $hash{$_}->[1] || 0 ) + 1 ] foreach @sequence; print $_, ( $hash{$_}->[0] / $hash{$_}->[1] ) foreach keys %hash;
    --------------
    "But what of all those sweet words you spoke in private?"
    "Oh that's just what we call pillow talk, baby, that's all."
Re: finding matches in the same array
by sh1tn (Priest) on Apr 13, 2005 at 17:28 UTC
    @seq = ('acgt','actg','cggt','cggt','actg'); @num = ('1234','2345','3244','3455','5230'); Wrong algorithm #for( 0..$#seq ){ # $struct{$seq[$_]} ||= $num[$_]; # $struct{$seq[$_]} = ($struct{$seq[$_]}+$num[$_]) / 2 #}


      This gives more weight to the last number (if there are more than 2):
      my @sequence = ('acgt','actg','cggt','cggt', 'actg', 'actg'); my @numbers = ('1234','2345','3244','3455', '5230', '100000' );
      gives 51893.75 when it should be 34526.3333333333
      --------------
      "But what of all those sweet words you spoke in private?"
      "Oh that's just what we call pillow talk, baby, that's all."
        You are right, it's utterly different algorithm.
        The correct one is as follows:
        @seq = ('acgt','actg','cggt','cggt', 'actg', 'actg'); @num = ('1234','2345','3244','3455', '5230', '100000' ); for( 0..$#seq ){ $struct{$seq[$_]}->[0] += $num[$_] and $struct{$seq[$_]}->[1]++ } for( keys %struct ){ print "$_\t", $struct{$_}->[0] / $struct{$_}->[1], $/ } __END__ STDOUT: acgt 1234 cggt 3349.5 actg 35858.3333333333


Re: finding matches in the same array
by salva (Canon) on Apr 14, 2005 at 22:19 UTC
    use two hashes, one for totals an other for counters, them use two loops, one to populate these hashes and other to calculate the averages as total/counter:
    my @sequence = ('actg','actg','cggt','cggt'); my @numbers = ('1234','2345','3244','3455'); my (%total, %count, $i); for ($i=0; $i<@sequence; $i++) { $total{$sequence[$i]}+=$numbers[$i]; $count{$sequence[$i]}++; } for (sort keys %total) { my $avg=$total{$_}/$count{$_}; print "$_ avg: $avg\n"; }