Find duplicate based on specific fields while allowing 2 mismatch

amitgsir has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file like this.

chrM:307    0    AGCGGGGA    129
chrM:307    0    AGCGGGGA    130
chrM:307    0    AGCGGGGA    129
chrM:308    0    AGCGGGGA    129
chrM:308    0    AGCGGGGA    130
chrM:308    0    AGCGGGGA    129
chrM:309    0    AGCGGGGA    129
chrM:309    0    AGCGGGGA    130
chrM:309    0    AGCGGGGA    129
chrM:307    0    TCAAAATG    130
chrM:308    0    TCAAAATG    130
chrM:309    0    TCAAAATG    130
chrM:307    0    TCACGGTG    130
chrM:308    0    TCACGGTG    130
chrM:309    0    TCACGGTG    130
chrM:307    0    TCAGCCTG    129
chrM:308    0    TCAGCCTG    129
chrM:309    0    TCAGCCTG    129
chrM:307    0    TCAGGGAG    130
chrM:308    0    TCAGGGAG    130
chrM:309    0    TCAGGGAG    130
chrM:307    1    TCAGGGTG    106
chrM:307    2    TCAGGGTG    130
chrM:307    2    TCAGGGTG    129
chrM:308    1    TCAGGGTG    106
chrM:308    2    TCAGGGTG    130
chrM:308    2    TCAGGGTG    129
chrM:309    1    TCAGGGTG    106
chrM:309    2    TCAGGGTG    130
chrM:309    2    TCAGGGTG    129
[download]

Update

Let me put my question in more correct.

I have table with 4 columns (there are more columns in the file but all the important information has been extracted in these 4 columns to complete the task) like below:

Column1: It contains chromosome positions e.g. chrM:307, chr1:11888, etc. This column is the base of first sorting. So, assumes that all the 10 lines with chrM:307 positions makes 1 cluster which must be divided into multiple cluster based on the value in column3 (UMI).

Column2: This value is the NM tag in the (NGS data) SAM file which corresponds to the error in pcr/sequencing step. So, minimum the column value is good.

Column3: This columns shows the UMI (Unique Molecular Identifier) sequence which has been used during the PCR step. During PCR step there might be mutation in the UMI as well as the original DNA sequence (which is in the further columns in the file, not showed here).

Therefore, I want to allow 2 mismatch in the UMI column to consider the PCR error.

So, if some raw has same chr position and similar UMI (e.g. TCAGGGTG, TCAGAATG, TCAGGGAG) these represents that the reads are from the same original DNA fragment. However, its impossible to decide which one is the starting UMI sequence to make the pattern with 2 mismatch. So, if we take TCAGGGTG as starting UMI, TCAAAATG can not be clustered together with it. However, if we take TCAGGGAG as starting UMI TCAAAATG can make same cluster.

Therefore, selection of first read is important. I missed to explain this before in detail and I couldn't make any code for this. So, let's see, how to decide first read in below part(after column 4 info).

Column4: Column 4 is the length of the sequence in the SAM file, i.e. length of 15th column (not shown here since only length value is important nor sequence for my objective) in each line.

How to choose first line to select UMI for allowing 2 mismatch in next step:

Step1. Find all lines starting at same positions and make group. Now sort all lines by columns 3 (first sorting), then select line with minimum column 2 value (Second sorting) and maximum column 4 value (Third sorting).

Step2. Choose above selected line and decide as representative UMI (column 3). Now check remaining lines (with same starting positions) if have similar UMI (UMI upto 2 mismatch to selected representative UMI) make all lines as a single cluster.

Step3. Select line not having similar UMI in step two, and loop step 1 and 2 again, i.e. find minimum col2 and maximum col4 read, get representative UMI and find line with similar UMI to make new cluster. Loop through again and again until all lines are not used to make a cluster.

Step4. Select next group of lines, starting with next available positions and loop step 1,2 and 3.

It will also change the output how it looks like, I tried to make some artificial data for demonstration but missed the value of column 2 and 4.

Let's understand the final output in two step:

Step 1: This is what we will receive after sorting all line starting with chrM:307

chrM:307    0    AGCGGGGA    130     #-->Line1
chrM:307    0    TCAAAATG    130     #-->Line2
chrM:307    0    TCACGGTG    130     #-->Line3
chrM:307    0    TCAGGGAG    130     #-->Line4
chrM:307    2    TCAGGGTG    130     #-->Line5
chrM:307    0    AGCGGGGA    129     #-->Line6
chrM:307    0    AGCGGGGA    129     #-->Line7
chrM:307    0    TCAGCCTG    129     #-->Line8
chrM:307    2    TCAGGGTG    129     #-->Line9
chrM:307    1    TCAGGGTG    106     #-->Line10
[download]

Step 1: This is what we will receive after final clustering for the lines at position chrM:307

CLUSTER
chrM:307    0    AGCGGGGA    130     #--> Select, first line sorted by
+ col3, col2, col4(rev sort)    # Line1
chrM:307    0    AGCGGGGA    129     #--> Select line with similar UMI
+  (NO mismatch)                # Line6
chrM:307    0    AGCGGGGA    129     #--> Select line with similar UMI
+  (NO mismatch)                # Line7
CLUSTER
chrM:307    0    TCAAAATG    130     #--> Now selected, Second line so
+rted by col3, col2, col4(rev sort)  #Line2 # NO similar UMI, 1 line c
+luster
CLUSTER
chrM:307    0    TCACGGTG    130     #--> Now selected, Third line sor
+ted by col3, col2, col4(rev sort)  # Line3
chrM:307    0    TCAGGGAG    130     #--> Select line with similar UMI
+  (Two mismatch with line3 UMI)    # Line4
chrM:307    2    TCAGGGTG    130     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line5
chrM:307    2    TCAGGGTG    129     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line9
chrM:307    1    TCAGGGTG    106     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line10
CLUSTER
chrM:307    0    TCAGCCTG    129     #--> Now selected, line#8 sorted 
+by col3, col2, col4(rev sort)      #Line8, make single line cluster
CLUSTER
....and so on for next chr positions...
[download]

I hope above example make it more clear

I want to cluster the lines based on two values:

1. First column have same value, i.e. the chromosome position must be same. (EDITED)

2. Column 3 must also be similar, i.e. in each cluster lines with similar UMI allowing 2 mismatch will be clustered together. Allowing two mismatch must be done corresponding to the UMI (column3) infirst line. e.g. if UMI is AAAAAAAA than AAAATTAA and AAAAAAGG must be in the same cluster, but not the line with AAAATTGG. (However, AAAATTGG is also have two mismatch to AAAATTAA and AAAAAAGG, so using first line will be key here)

And finally, differentiate each cluster by line delimiter as CLUSTER(or any string line DELIMIT). So, my output should looks like below.

CLUSTER
chrM:307    1    TCAGGGTG    106
chrM:307    2    TCAGGGTG    130
chrM:307    2    TCAGGGTG    129
chrM:307    0    TCAGGGAG    130
chrM:307    0    TCAGCCTG    129
chrM:307    0    TCACGGTG    130
CLUSTER
chrM:307    0    TCAAAATG    130
CLUSTER
chrM:307    0    AGCGGGGA    129
chrM:307    0    AGCGGGGA    130
chrM:307    0    AGCGGGGA    129
CLUSTER
chrM:308    1    TCAGGGTG    106
chrM:308    2    TCAGGGTG    130
chrM:308    2    TCAGGGTG    129
chrM:308    0    TCAGGGAG    130
chrM:308    0    TCAGCCTG    129
chrM:308    0    TCACGGTG    130
CLUSTER
chrM:308    0    TCAAAATG    130
CLUSTER
chrM:308    0    AGCGGGGA    129
chrM:308    0    AGCGGGGA    130
chrM:308    0    AGCGGGGA    129
CLUSTER
chrM:309    1    TCAGGGTG    106
chrM:309    2    TCAGGGTG    130
chrM:309    2    TCAGGGTG    129
chrM:309    0    TCAGGGAG    130
chrM:309    0    TCAGCCTG    129
chrM:309    0    TCACGGTG    130
CLUSTER
chrM:309    0    TCAAAATG    130
CLUSTER
chrM:309    0    AGCGGGGA    129
chrM:309    0    AGCGGGGA    130
chrM:309    0    AGCGGGGA    129
CLUSTER
[download]

Please note: Similar UMI makes single cluster, e.g. first cluster line 1-5 and 7, second cluster line 6 and third cluster line 8-10 and so on.

I tried the below code, we try to first cluster by column1 and then, makes all possible pattern using function "@subpats" and match the next line in the file one by one. But, I guess I am having big issue in the looping. Below is my code:

#!/usr/bin/env/perl

#use strict;
#use warnings;

my $InSam = $ARGV[0];
my $SampleID = $ARGV[1];
chomp $InSam;
chomp $SampleID;


open(DATA, "$InSam") || die "Can't open $InSam: $!\n";
open my $OFILE, '>', $SampleID.'.cluster_ChrPos.txt' or die "Cannot cr
+eate file for output: $!";

use strict;
my %seen=();
my @flds;
my $UMI;

my @cluster_ChrPos;
while (<DATA>)
{
chomp;
@flds=split /\t/;

my $POS;




    if ($seen{$flds[0]}++)
        {
                $UMI=$flds[2];
                my @subpats;
                for my $i (0..length($UMI)-1) {
                for my $j ($i+1..length($UMI)-1) {
                my $subpat = join('',
                substr($UMI, 0, $i),
                '.',  # or '\\w'
                substr($UMI, $i+1, $j-$i-1),
                '.',  # or '\\w'
                substr($UMI, $j+1),
                );
                push @subpats, $subpat;
                }
                }

                my $pat = join('|', @subpats);
                        
        if($flds[0] =~ m/"$pat"/)
            {
            #print "$pat\t$flds[0]";
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");
            }
        else
            {
            #push (@cluster_ChrPos, "DELIMIT\n");
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");    
            }
        }    
        else
            {
            push (@cluster_ChrPos, "DELIMIT\n");
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");
            }
}
push (@cluster_ChrPos, "DELIMIT");

print  $OFILE  "@cluster_ChrPos\n";
print  "@cluster_ChrPos\n";
close $OFILE;
close DATA;
[download]

Please help me out. Any suggestion is recommended.

Comment on Find duplicate based on specific fields while allowing 2 mismatch Select or Download Code

Replies are listed 'Best First'.
Re: Find duplicate based on specific fields while allowing 2 mismatch by kcott (Archbishop) on Aug 28, 2017 at 09:17 UTC
G'day amitgsir, There are a number of issues with your post: What does "UMI" mean? I'm guessing that's domain-specific jargon. You should explain this sort of thing in your post or, at the very least, provide a link to a definition. What does "2 mismatch" mean? Perhaps that's meaningful when "UMI" is explained. `CLUSTER` and `DELIMIT`: Both are mentioned in your textual description. Only `CLUSTER` appears in what I assume is your expected output. Only `DELIMIT` appears in your code. Only `DELIMIT` appears in your actual output (as a result of the last point). Using package variables for filehandles is a bad choice. Use lexical variables as shown in the open documentation. The token '`DATA`' is special to Perl; see "perldata: Special Literals". Using this (or any other special token) in your code for something other than its intended purpose should be avoided. I've used it in my code example below so you can see normal usage. Don't comment out `strict` and `warnings`. Leave them in and fix the problems they highlight. If you don't understand the problem, or how to fix it, you can always ask here. I'm pretty sure chomping elements of the `@ARGV` array is unnecessary. You can test that with something like '`print "\|$var\|"`', which will show terminal newlines (and any other unexpected or unwanted characters). Also note that, in my example code below, I haven't used `chomp` at all: consider whether removing newlines from input, then adding them back for output, is actually gaining you anything — it's often just a lot of redundant processing. Your code layout leaves much to be desired. Inconsistent indentation, and swathes of unnecessary whitespace, reduce readability. Choose a coding style that suits you and stick to it: "perlstyle - Perl style guide" has tips on how to do this. The following code produces what I'm guessing is, at least, close to what you want. It produces the output in the same groupings you ~~show~~showed (see Update below), and it includes both the `CLUSTER` and `DELIMIT` tokens. Given the points I've already made regarding those tokens, this may not be exactly what you want: as stated, this is a guess on my part; modify to suit your needs. #!/usr/bin/env perl use strict; use warnings; use constant { COL0 => 0, COL2 => 1, ALL => 2 }; my ($col0, $col2) = ('', ''); print map { if ($col0 ne $_->[COL0]) { ($col0, $col2) = @{$_}[COL0,COL2]; ("CLUSTER\n", $_->[ALL]); } elsif (substr($col2, 0, 3) ne substr($_->[COL2], 0, 3)) { $col2 = $_->[COL2]; ("DELIMIT\n", $_->[ALL]); } else { $_->[ALL]; } } sort { $a->[COL0] cmp $b->[COL0] \|\| $b->[COL2] cmp $a->[COL2] } map { [ (split)[0,2], $_ ] } <DATA>; __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 [download] Output: CLUSTER chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCACGGTG 130 chrM:307 0 TCAAAATG 130 DELIMIT chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCACGGTG 130 chrM:308 0 TCAAAATG 130 DELIMIT chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCACGGTG 130 chrM:309 0 TCAAAATG 130 DELIMIT chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 [download] By the way, the construct I've used in the form of `... map { ... } sort { ... } map { ... } ...` [download] is known as a Schwartzian Transform. It is particularly useful for sorting based on parts of a string while retaining the entire string, in an unmodified form, for eventual output. Update: The original, expected output has changed. There is no notification of what changed or even that a change has been made. Here's the original (in the spoiler) that my response is based on: <Reveal this spoiler or all in this thread> — Ken	[reply] [d/l] [select]
Re^2: Find duplicate based on specific fields while allowing 2 mismatch by amitgsir (Novice) on Aug 29, 2017 at 00:51 UTC
Thanks for your time and sorry I missed to update without keeping spoiler tag. I tried to explain the question again with more detail. It is interesting to learn the map contruct, I was trying to do with just using Arrays which was I guess not efficient. I am learning more about 'hash' and 'map' using perlmonks suggestion. Thanks!	[reply]
Re: Find duplicate based on specific fields while allowing 2 mismatch by shadowsong (Pilgrim) on Aug 28, 2017 at 11:05 UTC
Hi amitgsir Not wanting to assume what your criteria is for UMIs that comprise a cluster; could you expand on what you mean by: 2. Column 3 must also be similar, i.e. in each cluster lines with similar UMI allowing 2 mismatch will be clustered together We need a bit more to go on: What do the [up to 3] UMIs need to have in common to be considered similar? Why is TCACGGTG in the first cluster instead of TCAAAATG? If it's all the same and it doesn't matter what the UMIs are as long as you have them in sets of 3s; you could try this: #!perl -slw use strict; my ($chromosomes,$DELIMITER) = (undef,'CLUSTER'); while ( <DATA> ) { s/\R//g; # remove line breaks; my $record = [split /\s+/]; push @{$chromosomes->{$record->[0]}->{$record->[2]}},[$record->[1] +,$record->[3]]; } foreach my $chrM (sort keys %{$chromosomes}) { my $cnt = 0; # used to print delimiter foreach my $UMI (sort {$a cmp $b} keys %{$chromosomes->{$chrM}}) { print $DELIMITER unless $cnt++ % 3; print "$chrM\t$_->[0]\t$UMI\t$_->[1]" foreach (sort {$a->[0] <=> $b->[0] or $a->[1] <=> $b->[1]} @{$chromosomes->{$chrM}->{$UMI +}}); } } __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 [download] Output C:\code\perlmonks>perl pm_1198131.pl CLUSTER chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 CLUSTER chrM:307 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 129 chrM:307 2 TCAGGGTG 130 CLUSTER chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 TCAAAATG 130 chrM:308 0 TCACGGTG 130 CLUSTER chrM:308 0 TCAGCCTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 129 chrM:308 2 TCAGGGTG 130 CLUSTER chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 TCAAAATG 130 chrM:309 0 TCACGGTG 130 CLUSTER chrM:309 0 TCAGCCTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 129 chrM:309 2 TCAGGGTG 130 [download] Cheers Shadowsong	[reply] [d/l] [select]
Re: Find duplicate based on specific fields while allowing 2 mismatch by tybalt89 (Monsignor) on Aug 28, 2017 at 18:00 UTC
This gives your "expected output" but there's a conflict between my sort and your sort description. #!/usr/bin/perl -l # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; push @{ $one{$one} }, [ $two, $three, $four ]; } for my $one (sort keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four ]; } else { push @{ $groups{$three} }, [ $two, $three, $four ]; } } for ( sort keys %groups ) { print 'CLUSTER'; for ( values @{ $groups{$_} } ) { print join "\t", $one, @$_; } } } __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 [download] Output: CLUSTER chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:307 0 TCAAAATG 130 CLUSTER chrM:307 0 TCACGGTG 130 chrM:307 0 TCAGGGAG 130 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:307 0 TCAGCCTG 129 CLUSTER chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:308 0 TCAAAATG 130 CLUSTER chrM:308 0 TCACGGTG 130 chrM:308 0 TCAGGGAG 130 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 CLUSTER chrM:308 0 TCAGCCTG 129 CLUSTER chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 CLUSTER chrM:309 0 TCAAAATG 130 CLUSTER chrM:309 0 TCACGGTG 130 chrM:309 0 TCAGGGAG 130 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 CLUSTER chrM:309 0 TCAGCCTG 129 [download]	[reply] [d/l] [select]
Re^2: Find duplicate based on specific fields while allowing 2 mismatch by Anonymous Monk on Aug 29, 2017 at 00:38 UTC
Hi, thanks for your time! It is providing the expected output. Finally, I am wordering about two more thing: 1. How to get the sorted results by col1. When I tested on large dataset output is not sorted. 2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20. For example: When I am passing the below data as input: Read more... (70 kB) Out is below: Read more... (12 kB)	[reply] [d/l] [select]
Re^3: Find duplicate based on specific fields while allowing 2 mismatch by tybalt89 (Monsignor) on Aug 29, 2017 at 01:16 UTC
Like this? (Just a few minor tweaks :) BTW, it was sorting, just not numerically. #!/usr/bin/perl # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; /(\d+)/; # get number for later sort push @{ $one{$1} }, [ $two, $three, $four, $_ ]; } for my $one (sort { $a <=> $b } keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four, $all ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four, $all ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } else { push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } } for ( sort keys %groups ) { print "CLUSTER\n"; print $_->[3] for values @{ $groups{$_} }; } } __DATA__ chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 107 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 114 chrM:307 1 TCAGGGTG 106 chrM:310 0 TCAGGGTG 99 chrM:392 2 CCTCTTAT 130 chrM:396 2 AGTTACTA 129 chrM:443 0 ATTATCAA 130 extra columns chrM:542 2 AATCCAAA 129 chrM:542 0 AATCCAAA 129 chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 chrM:1001 0 CGTGACAT 129 chrM:1127 0 GATACTAA 130 chrM:1257 1 TGGAAATC 129 chrM:1262 0 CGGGAAGC 129 chrM:1262 0 AGGGAAGG 129 chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130 [download] Outputs: CLUSTER chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 114 chrM:307 0 TCAGGGTG 107 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:310 0 TCAGGGTG 99 CLUSTER chrM:392 2 CCTCTTAT 130 CLUSTER chrM:396 2 AGTTACTA 129 CLUSTER chrM:443 0 ATTATCAA 130 extra columns CLUSTER chrM:542 0 AATCCAAA 129 chrM:542 2 AATCCAAA 129 CLUSTER chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 CLUSTER chrM:1001 0 CGTGACAT 129 CLUSTER chrM:1127 0 GATACTAA 130 CLUSTER chrM:1257 1 TGGAAATC 129 CLUSTER chrM:1262 0 AGGGAAGG 129 chrM:1262 0 CGGGAAGC 129 CLUSTER chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130 [download]	[reply] [d/l] [select]
Re^4: Find duplicate based on specific fields while allowing 2 mismatch by amitgsir (Novice) on Aug 29, 2017 at 01:34 UTC
Re^5: Find duplicate based on specific fields while allowing 2 mismatch by tybalt89 (Monsignor) on Aug 29, 2017 at 01:52 UTC
Some notes below your chosen depth have not been shown here
Re^2: Find duplicate based on specific fields while allowing 2 mismatch by amitgsir (Novice) on Aug 29, 2017 at 00:47 UTC
Hi, thanks for your time! It is providing the expected output. Sorry I made the post without login and missed the formating for the output. I Can't modify it so I am posting here again. Finally, I am wordering about two more thing: 1. How to get the sorted results by col1. When I tested on large dataset output is not sorted. 2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20. For example: When I am passing the below data as input: Read more... (81 kB)	[reply] [d/l] [select]
Re^3: Find duplicate based on specific fields while allowing 2 mismatch by tybalt89 (Monsignor) on Aug 29, 2017 at 01:24 UTC
See Re^3: Find duplicate based on specific fields while allowing 2 mismatch	[reply]
Re: Find duplicate based on specific fields while allowing 2 mismatch by hdb (Monsignor) on Aug 28, 2017 at 07:59 UTC
Can you please clarify your second requirement? The relation "two mismatches" is not transitive, e.g. there are two mismatches between `AAAA` and `AAGG` and two mismatches between `AAGG` and `TTGG` but four mismatches between `AAAA` and `TTGG`. Would you consider all three to belong to one cluster?	[reply] [d/l] [select]
Re^2: Find duplicate based on specific fields while allowing 2 mismatch by amitgsir (Novice) on Aug 28, 2017 at 08:21 UTC
I think, I need to take the first entry as the reference and allow the two possible mismatch using first line's UMI tag to make one cluster. Remaning lines at the same start positions, can be looped again similarly. So, if the first line have AAAA then AAGG or TTAA, etc can be merged into single cluster, But, TTGG will make separate cluster. I have edited the question for the same! Amit	[reply]
Re^3: Find duplicate based on specific fields while allowing 2 mismatch by kcott (Archbishop) on Aug 28, 2017 at 10:21 UTC
"I have edited the question for the same!" Do not just edit your question without showing very clearly what you've changed: it often means that answers to your original post no longer make sense. You can do this in a number of ways: For a small change within a sentence: `<del>old text</del><ins>new text</ins>`. To remove a large amount of text, code, or data: add an Update comment explaining what you're doing, then `<strike>... part to remove ...</strike>`. To add new content: use an Update comment explaining what you're doing, then add the new content. To change a large amount of text, code, or data: add an Update comment explaining what you're doing, then `<strike>...</strike>` (as per the ~~last~~earlier point); then add replacement text, code, or data. If you're striking out significant sections of your post, and the result ends up difficult to read, consider putting those parts inside `<spoiler>...</spoiler>` or `<readmore>...</readmore>` tags; however, ensure the Update comment is visible. Never just delete some, or all, of your post! See "How do I change/delete my post?" for details and further discussion. — Ken	[reply] [d/l] [select]
Re: Find duplicate based on specific fields while allowing 2 mismatch by hexcoder (Curate) on Aug 28, 2017 at 15:30 UTC
UMI (Unique molecular identifier) are the values in the third column, I assume.	[reply]
Re: Find duplicate based on specific fields while allowing 2 mismatch by amitgsir (Novice) on Aug 28, 2017 at 06:54 UTC
This is the output I am getting. DELIMIT chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 0 TCAGGGAG 130 chrM:307 0 TCAGCCTG 129 chrM:307 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 DELIMIT chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 0 TCAGGGAG 130 chrM:308 0 TCAGCCTG 129 chrM:308 0 TCAAAATG 130 chrM:308 0 TCACGGTG 130 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 DELIMIT chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 0 TCAGGGAG 130 chrM:309 0 TCAGCCTG 129 chrM:309 0 TCAAAATG 130 chrM:309 0 TCACGGTG 130 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 DELIMIT [download]	[reply] [d/l]