Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Find duplicate based on specific fields while allowing 2 mismatch

by amitgsir (Novice)
on Aug 28, 2017 at 06:53 UTC ( [id://1198131]=perlquestion: print w/replies, xml ) Need Help??

amitgsir has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file like this.

chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

Update

Let me put my question in more correct.

I have table with 4 columns (there are more columns in the file but all the important information has been extracted in these 4 columns to complete the task) like below:

Column1: It contains chromosome positions e.g. chrM:307, chr1:11888, etc. This column is the base of first sorting. So, assumes that all the 10 lines with chrM:307 positions makes 1 cluster which must be divided into multiple cluster based on the value in column3 (UMI).

Column2: This value is the NM tag in the (NGS data) SAM file which corresponds to the error in pcr/sequencing step. So, minimum the column value is good.

Column3: This columns shows the UMI (Unique Molecular Identifier) sequence which has been used during the PCR step. During PCR step there might be mutation in the UMI as well as the original DNA sequence (which is in the further columns in the file, not showed here).

Therefore, I want to allow 2 mismatch in the UMI column to consider the PCR error.

So, if some raw has same chr position and similar UMI (e.g. TCAGGGTG, TCAGAATG, TCAGGGAG) these represents that the reads are from the same original DNA fragment. However, its impossible to decide which one is the starting UMI sequence to make the pattern with 2 mismatch. So, if we take TCAGGGTG as starting UMI, TCAAAATG can not be clustered together with it. However, if we take TCAGGGAG as starting UMI TCAAAATG can make same cluster.

Therefore, selection of first read is important. I missed to explain this before in detail and I couldn't make any code for this. So, let's see, how to decide first read in below part(after column 4 info).

Column4: Column 4 is the length of the sequence in the SAM file, i.e. length of 15th column (not shown here since only length value is important nor sequence for my objective) in each line.

How to choose first line to select UMI for allowing 2 mismatch in next step:

Step1. Find all lines starting at same positions and make group. Now sort all lines by columns 3 (first sorting), then select line with minimum column 2 value (Second sorting) and maximum column 4 value (Third sorting).

Step2. Choose above selected line and decide as representative UMI (column 3). Now check remaining lines (with same starting positions) if have similar UMI (UMI upto 2 mismatch to selected representative UMI) make all lines as a single cluster.

Step3. Select line not having similar UMI in step two, and loop step 1 and 2 again, i.e. find minimum col2 and maximum col4 read, get representative UMI and find line with similar UMI to make new cluster. Loop through again and again until all lines are not used to make a cluster.

Step4. Select next group of lines, starting with next available positions and loop step 1,2 and 3.

It will also change the output how it looks like, I tried to make some artificial data for demonstration but missed the value of column 2 and 4.

Let's understand the final output in two step:

Step 1: This is what we will receive after sorting all line starting with chrM:307

chrM:307 0 AGCGGGGA 130 #-->Line1 chrM:307 0 TCAAAATG 130 #-->Line2 chrM:307 0 TCACGGTG 130 #-->Line3 chrM:307 0 TCAGGGAG 130 #-->Line4 chrM:307 2 TCAGGGTG 130 #-->Line5 chrM:307 0 AGCGGGGA 129 #-->Line6 chrM:307 0 AGCGGGGA 129 #-->Line7 chrM:307 0 TCAGCCTG 129 #-->Line8 chrM:307 2 TCAGGGTG 129 #-->Line9 chrM:307 1 TCAGGGTG 106 #-->Line10

Step 1: This is what we will receive after final clustering for the lines at position chrM:307

CLUSTER chrM:307 0 AGCGGGGA 130 #--> Select, first line sorted by + col3, col2, col4(rev sort) # Line1 chrM:307 0 AGCGGGGA 129 #--> Select line with similar UMI + (NO mismatch) # Line6 chrM:307 0 AGCGGGGA 129 #--> Select line with similar UMI + (NO mismatch) # Line7 CLUSTER chrM:307 0 TCAAAATG 130 #--> Now selected, Second line so +rted by col3, col2, col4(rev sort) #Line2 # NO similar UMI, 1 line c +luster CLUSTER chrM:307 0 TCACGGTG 130 #--> Now selected, Third line sor +ted by col3, col2, col4(rev sort) # Line3 chrM:307 0 TCAGGGAG 130 #--> Select line with similar UMI + (Two mismatch with line3 UMI) # Line4 chrM:307 2 TCAGGGTG 130 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line5 chrM:307 2 TCAGGGTG 129 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line9 chrM:307 1 TCAGGGTG 106 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line10 CLUSTER chrM:307 0 TCAGCCTG 129 #--> Now selected, line#8 sorted +by col3, col2, col4(rev sort) #Line8, make single line cluster CLUSTER ....and so on for next chr positions...

I hope above example make it more clear

I want to cluster the lines based on two values:

1. First column have same value, i.e. the chromosome position must be same. (EDITED)

2. Column 3 must also be similar, i.e. in each cluster lines with similar UMI allowing 2 mismatch will be clustered together. Allowing two mismatch must be done corresponding to the UMI (column3) infirst line. e.g. if UMI is AAAAAAAA than AAAATTAA and AAAAAAGG must be in the same cluster, but not the line with AAAATTGG. (However, AAAATTGG is also have two mismatch to AAAATTAA and AAAAAAGG, so using first line will be key here)

And finally, differentiate each cluster by line delimiter as CLUSTER(or any string line DELIMIT). So, my output should looks like below.

CLUSTER chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCACGGTG 130 CLUSTER chrM:307 0 TCAAAATG 130 CLUSTER chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCACGGTG 130 CLUSTER chrM:308 0 TCAAAATG 130 CLUSTER chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCACGGTG 130 CLUSTER chrM:309 0 TCAAAATG 130 CLUSTER chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 CLUSTER

Please note: Similar UMI makes single cluster, e.g. first cluster line 1-5 and 7, second cluster line 6 and third cluster line 8-10 and so on.

I tried the below code, we try to first cluster by column1 and then, makes all possible pattern using function "@subpats" and match the next line in the file one by one. But, I guess I am having big issue in the looping. Below is my code:

#!/usr/bin/env/perl #use strict; #use warnings; my $InSam = $ARGV[0]; my $SampleID = $ARGV[1]; chomp $InSam; chomp $SampleID; open(DATA, "$InSam") || die "Can't open $InSam: $!\n"; open my $OFILE, '>', $SampleID.'.cluster_ChrPos.txt' or die "Cannot cr +eate file for output: $!"; use strict; my %seen=(); my @flds; my $UMI; my @cluster_ChrPos; while (<DATA>) { chomp; @flds=split /\t/; my $POS; if ($seen{$flds[0]}++) { $UMI=$flds[2]; my @subpats; for my $i (0..length($UMI)-1) { for my $j ($i+1..length($UMI)-1) { my $subpat = join('', substr($UMI, 0, $i), '.', # or '\\w' substr($UMI, $i+1, $j-$i-1), '.', # or '\\w' substr($UMI, $j+1), ); push @subpats, $subpat; } } my $pat = join('|', @subpats); if($flds[0] =~ m/"$pat"/) { #print "$pat\t$flds[0]"; push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } else { #push (@cluster_ChrPos, "DELIMIT\n"); push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } } else { push (@cluster_ChrPos, "DELIMIT\n"); push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } } push (@cluster_ChrPos, "DELIMIT"); print $OFILE "@cluster_ChrPos\n"; print "@cluster_ChrPos\n"; close $OFILE; close DATA;
Please help me out. Any suggestion is recommended.

Replies are listed 'Best First'.
Re: Find duplicate based on specific fields while allowing 2 mismatch
by kcott (Archbishop) on Aug 28, 2017 at 09:17 UTC

    G'day amitgsir,

    There are a number of issues with your post:

    • What does "UMI" mean? I'm guessing that's domain-specific jargon. You should explain this sort of thing in your post or, at the very least, provide a link to a definition.
    • What does "2 mismatch" mean? Perhaps that's meaningful when "UMI" is explained.
    • CLUSTER and DELIMIT:
      • Both are mentioned in your textual description.
      • Only CLUSTER appears in what I assume is your expected output.
      • Only DELIMIT appears in your code.
      • Only DELIMIT appears in your actual output (as a result of the last point).
    • Using package variables for filehandles is a bad choice. Use lexical variables as shown in the open documentation.
    • The token 'DATA' is special to Perl; see "perldata: Special Literals". Using this (or any other special token) in your code for something other than its intended purpose should be avoided. I've used it in my code example below so you can see normal usage.
    • Don't comment out strict and warnings. Leave them in and fix the problems they highlight. If you don't understand the problem, or how to fix it, you can always ask here.
    • I'm pretty sure chomping elements of the @ARGV array is unnecessary. You can test that with something like 'print "|$var|"', which will show terminal newlines (and any other unexpected or unwanted characters). Also note that, in my example code below, I haven't used chomp at all: consider whether removing newlines from input, then adding them back for output, is actually gaining you anything — it's often just a lot of redundant processing.
    • Your code layout leaves much to be desired. Inconsistent indentation, and swathes of unnecessary whitespace, reduce readability. Choose a coding style that suits you and stick to it: "perlstyle - Perl style guide" has tips on how to do this.

    The following code produces what I'm guessing is, at least, close to what you want. It produces the output in the same groupings you showshowed (see Update below), and it includes both the CLUSTER and DELIMIT tokens. Given the points I've already made regarding those tokens, this may not be exactly what you want: as stated, this is a guess on my part; modify to suit your needs.

    #!/usr/bin/env perl use strict; use warnings; use constant { COL0 => 0, COL2 => 1, ALL => 2 }; my ($col0, $col2) = ('', ''); print map { if ($col0 ne $_->[COL0]) { ($col0, $col2) = @{$_}[COL0,COL2]; ("CLUSTER\n", $_->[ALL]); } elsif (substr($col2, 0, 3) ne substr($_->[COL2], 0, 3)) { $col2 = $_->[COL2]; ("DELIMIT\n", $_->[ALL]); } else { $_->[ALL]; } } sort { $a->[COL0] cmp $b->[COL0] || $b->[COL2] cmp $a->[COL2] } map { [ (split)[0,2], $_ ] } <DATA>; __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

    Output:

    CLUSTER chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCACGGTG 130 chrM:307 0 TCAAAATG 130 DELIMIT chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCACGGTG 130 chrM:308 0 TCAAAATG 130 DELIMIT chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCACGGTG 130 chrM:309 0 TCAAAATG 130 DELIMIT chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129

    By the way, the construct I've used in the form of

    ... map { ... } sort { ... } map { ... } ...

    is known as a Schwartzian Transform. It is particularly useful for sorting based on parts of a string while retaining the entire string, in an unmodified form, for eventual output.

    Update: The original, expected output has changed. There is no notification of what changed or even that a change has been made. Here's the original (in the spoiler) that my response is based on:

    CLUSTER chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 CLUSTER chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCAAAATG 130 chrM:308 0 TCACGGTG 130 CLUSTER chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCAAAATG 130 chrM:309 0 TCACGGTG 130 CLUSTER chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 CLUSTER

    — Ken

      Thanks for your time and sorry I missed to update without keeping spoiler tag.

      I tried to explain the question again with more detail.

      It is interesting to learn the map contruct, I was trying to do with just using Arrays which was I guess not efficient. I am learning more about 'hash' and 'map' using perlmonks suggestion. Thanks!

Re: Find duplicate based on specific fields while allowing 2 mismatch
by shadowsong (Pilgrim) on Aug 28, 2017 at 11:05 UTC

    Hi amitgsir

    Not wanting to assume what your criteria is for UMIs that comprise a cluster; could you expand on what you mean by:

    2. Column 3 must also be similar, i.e. in each cluster lines with similar UMI allowing 2 mismatch will be clustered together

    We need a bit more to go on:

    1. What do the [up to 3] UMIs need to have in common to be considered similar?
    2. Why is TCACGGTG in the first cluster instead of TCAAAATG?

    If it's all the same and it doesn't matter what the UMIs are as long as you have them in sets of 3s; you could try this:

    #!perl -slw use strict; my ($chromosomes,$DELIMITER) = (undef,'CLUSTER'); while ( <DATA> ) { s/\R//g; # remove line breaks; my $record = [split /\s+/]; push @{$chromosomes->{$record->[0]}->{$record->[2]}},[$record->[1] +,$record->[3]]; } foreach my $chrM (sort keys %{$chromosomes}) { my $cnt = 0; # used to print delimiter foreach my $UMI (sort {$a cmp $b} keys %{$chromosomes->{$chrM}}) { print $DELIMITER unless $cnt++ % 3; print "$chrM\t$_->[0]\t$UMI\t$_->[1]" foreach (sort {$a->[0] <=> $b->[0] or $a->[1] <=> $b->[1]} @{$chromosomes->{$chrM}->{$UMI +}}); } } __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

    Output

    C:\code\perlmonks>perl pm_1198131.pl CLUSTER chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 CLUSTER chrM:307 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 129 chrM:307 2 TCAGGGTG 130 CLUSTER chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 TCAAAATG 130 chrM:308 0 TCACGGTG 130 CLUSTER chrM:308 0 TCAGCCTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 129 chrM:308 2 TCAGGGTG 130 CLUSTER chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 TCAAAATG 130 chrM:309 0 TCACGGTG 130 CLUSTER chrM:309 0 TCAGCCTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 129 chrM:309 2 TCAGGGTG 130

    Cheers
    Shadowsong

Re: Find duplicate based on specific fields while allowing 2 mismatch
by tybalt89 (Monsignor) on Aug 28, 2017 at 18:00 UTC

    This gives your "expected output" but there's a conflict between my sort and your sort description.

    #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; push @{ $one{$one} }, [ $two, $three, $four ]; } for my $one (sort keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four ]; } else { push @{ $groups{$three} }, [ $two, $three, $four ]; } } for ( sort keys %groups ) { print 'CLUSTER'; for ( values @{ $groups{$_} } ) { print join "\t", $one, @$_; } } } __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

    Output:

    CLUSTER chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:307 0 TCAAAATG 130 CLUSTER chrM:307 0 TCACGGTG 130 chrM:307 0 TCAGGGAG 130 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:307 0 TCAGCCTG 129 CLUSTER chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:308 0 TCAAAATG 130 CLUSTER chrM:308 0 TCACGGTG 130 chrM:308 0 TCAGGGAG 130 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 CLUSTER chrM:308 0 TCAGCCTG 129 CLUSTER chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 CLUSTER chrM:309 0 TCAAAATG 130 CLUSTER chrM:309 0 TCACGGTG 130 chrM:309 0 TCAGGGAG 130 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 CLUSTER chrM:309 0 TCAGCCTG 129

      Hi, thanks for your time! It is providing the expected output.

      Finally, I am wordering about two more thing:

      1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.

      2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.

      For example: When I am passing the below data as input:

      Out is below:

        Like this? (Just a few minor tweaks :)

        BTW, it was sorting, just not numerically.

        #!/usr/bin/perl # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; /(\d+)/; # get number for later sort push @{ $one{$1} }, [ $two, $three, $four, $_ ]; } for my $one (sort { $a <=> $b } keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four, $all ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four, $all ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } else { push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } } for ( sort keys %groups ) { print "CLUSTER\n"; print $_->[3] for values @{ $groups{$_} }; } } __DATA__ chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 107 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 114 chrM:307 1 TCAGGGTG 106 chrM:310 0 TCAGGGTG 99 chrM:392 2 CCTCTTAT 130 chrM:396 2 AGTTACTA 129 chrM:443 0 ATTATCAA 130 extra columns chrM:542 2 AATCCAAA 129 chrM:542 0 AATCCAAA 129 chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 chrM:1001 0 CGTGACAT 129 chrM:1127 0 GATACTAA 130 chrM:1257 1 TGGAAATC 129 chrM:1262 0 CGGGAAGC 129 chrM:1262 0 AGGGAAGG 129 chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130

        Outputs:

        CLUSTER chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 114 chrM:307 0 TCAGGGTG 107 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:310 0 TCAGGGTG 99 CLUSTER chrM:392 2 CCTCTTAT 130 CLUSTER chrM:396 2 AGTTACTA 129 CLUSTER chrM:443 0 ATTATCAA 130 extra columns CLUSTER chrM:542 0 AATCCAAA 129 chrM:542 2 AATCCAAA 129 CLUSTER chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 CLUSTER chrM:1001 0 CGTGACAT 129 CLUSTER chrM:1127 0 GATACTAA 130 CLUSTER chrM:1257 1 TGGAAATC 129 CLUSTER chrM:1262 0 AGGGAAGG 129 chrM:1262 0 CGGGAAGC 129 CLUSTER chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130

      Hi, thanks for your time! It is providing the expected output.

      Sorry I made the post without login and missed the formating for the output. I Can't modify it so I am posting here again.

      Finally, I am wordering about two more thing:

      1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.

      2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.

      For example: When I am passing the below data as input:

Re: Find duplicate based on specific fields while allowing 2 mismatch
by hdb (Monsignor) on Aug 28, 2017 at 07:59 UTC

    Can you please clarify your second requirement? The relation "two mismatches" is not transitive, e.g. there are two mismatches between AAAA and AAGG and two mismatches between AAGG and TTGG but four mismatches between AAAA and TTGG. Would you consider all three to belong to one cluster?

      I think, I need to take the first entry as the reference and allow the two possible mismatch using first line's UMI tag to make one cluster. Remaning lines at the same start positions, can be looped again similarly. So, if the first line have AAAA then AAGG or TTAA, etc can be merged into single cluster, But, TTGG will make separate cluster. I have edited the question for the same! Amit
        "I have edited the question for the same!"

        Do not just edit your question without showing very clearly what you've changed: it often means that answers to your original post no longer make sense.

        You can do this in a number of ways:

        • For a small change within a sentence: <del>old text</del><ins>new text</ins>.
        • To remove a large amount of text, code, or data: add an Update comment explaining what you're doing, then <strike>... part to remove ...</strike>.
        • To add new content: use an Update comment explaining what you're doing, then add the new content.
        • To change a large amount of text, code, or data: add an Update comment explaining what you're doing, then <strike>...</strike> (as per the lastearlier point); then add replacement text, code, or data.
        • If you're striking out significant sections of your post, and the result ends up difficult to read, consider putting those parts inside <spoiler>...</spoiler> or <readmore>...</readmore> tags; however, ensure the Update comment is visible.

        Never just delete some, or all, of your post!

        See "How do I change/delete my post?" for details and further discussion.

        — Ken

Re: Find duplicate based on specific fields while allowing 2 mismatch
by hexcoder (Curate) on Aug 28, 2017 at 15:30 UTC
Re: Find duplicate based on specific fields while allowing 2 mismatch
by amitgsir (Novice) on Aug 28, 2017 at 06:54 UTC
    This is the output I am getting.
    DELIMIT chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 DELIMIT chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCAAAATG 130 chrM:308 0 TCACGGTG 130 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 DELIMIT chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCAAAATG 130 chrM:309 0 TCACGGTG 130 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 DELIMIT

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1198131]
Approved by beech
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-03-28 11:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found