Find duplicate based on specific fields while allowing 2 mismatch

amitgsir has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file like this.

chrM:307    0    AGCGGGGA    129
chrM:307    0    AGCGGGGA    130
chrM:307    0    AGCGGGGA    129
chrM:308    0    AGCGGGGA    129
chrM:308    0    AGCGGGGA    130
chrM:308    0    AGCGGGGA    129
chrM:309    0    AGCGGGGA    129
chrM:309    0    AGCGGGGA    130
chrM:309    0    AGCGGGGA    129
chrM:307    0    TCAAAATG    130
chrM:308    0    TCAAAATG    130
chrM:309    0    TCAAAATG    130
chrM:307    0    TCACGGTG    130
chrM:308    0    TCACGGTG    130
chrM:309    0    TCACGGTG    130
chrM:307    0    TCAGCCTG    129
chrM:308    0    TCAGCCTG    129
chrM:309    0    TCAGCCTG    129
chrM:307    0    TCAGGGAG    130
chrM:308    0    TCAGGGAG    130
chrM:309    0    TCAGGGAG    130
chrM:307    1    TCAGGGTG    106
chrM:307    2    TCAGGGTG    130
chrM:307    2    TCAGGGTG    129
chrM:308    1    TCAGGGTG    106
chrM:308    2    TCAGGGTG    130
chrM:308    2    TCAGGGTG    129
chrM:309    1    TCAGGGTG    106
chrM:309    2    TCAGGGTG    130
chrM:309    2    TCAGGGTG    129
[download]

Update

Let me put my question in more correct.

I have table with 4 columns (there are more columns in the file but all the important information has been extracted in these 4 columns to complete the task) like below:

Column1: It contains chromosome positions e.g. chrM:307, chr1:11888, etc. This column is the base of first sorting. So, assumes that all the 10 lines with chrM:307 positions makes 1 cluster which must be divided into multiple cluster based on the value in column3 (UMI).

Column2: This value is the NM tag in the (NGS data) SAM file which corresponds to the error in pcr/sequencing step. So, minimum the column value is good.

Column3: This columns shows the UMI (Unique Molecular Identifier) sequence which has been used during the PCR step. During PCR step there might be mutation in the UMI as well as the original DNA sequence (which is in the further columns in the file, not showed here).

Therefore, I want to allow 2 mismatch in the UMI column to consider the PCR error.

So, if some raw has same chr position and similar UMI (e.g. TCAGGGTG, TCAGAATG, TCAGGGAG) these represents that the reads are from the same original DNA fragment. However, its impossible to decide which one is the starting UMI sequence to make the pattern with 2 mismatch. So, if we take TCAGGGTG as starting UMI, TCAAAATG can not be clustered together with it. However, if we take TCAGGGAG as starting UMI TCAAAATG can make same cluster.

Therefore, selection of first read is important. I missed to explain this before in detail and I couldn't make any code for this. So, let's see, how to decide first read in below part(after column 4 info).

Column4: Column 4 is the length of the sequence in the SAM file, i.e. length of 15th column (not shown here since only length value is important nor sequence for my objective) in each line.

How to choose first line to select UMI for allowing 2 mismatch in next step:

Step1. Find all lines starting at same positions and make group. Now sort all lines by columns 3 (first sorting), then select line with minimum column 2 value (Second sorting) and maximum column 4 value (Third sorting).

Step2. Choose above selected line and decide as representative UMI (column 3). Now check remaining lines (with same starting positions) if have similar UMI (UMI upto 2 mismatch to selected representative UMI) make all lines as a single cluster.

Step3. Select line not having similar UMI in step two, and loop step 1 and 2 again, i.e. find minimum col2 and maximum col4 read, get representative UMI and find line with similar UMI to make new cluster. Loop through again and again until all lines are not used to make a cluster.

Step4. Select next group of lines, starting with next available positions and loop step 1,2 and 3.

It will also change the output how it looks like, I tried to make some artificial data for demonstration but missed the value of column 2 and 4.

Let's understand the final output in two step:

Step 1: This is what we will receive after sorting all line starting with chrM:307

chrM:307    0    AGCGGGGA    130     #-->Line1
chrM:307    0    TCAAAATG    130     #-->Line2
chrM:307    0    TCACGGTG    130     #-->Line3
chrM:307    0    TCAGGGAG    130     #-->Line4
chrM:307    2    TCAGGGTG    130     #-->Line5
chrM:307    0    AGCGGGGA    129     #-->Line6
chrM:307    0    AGCGGGGA    129     #-->Line7
chrM:307    0    TCAGCCTG    129     #-->Line8
chrM:307    2    TCAGGGTG    129     #-->Line9
chrM:307    1    TCAGGGTG    106     #-->Line10
[download]

Step 1: This is what we will receive after final clustering for the lines at position chrM:307

CLUSTER
chrM:307    0    AGCGGGGA    130     #--> Select, first line sorted by
+ col3, col2, col4(rev sort)    # Line1
chrM:307    0    AGCGGGGA    129     #--> Select line with similar UMI
+  (NO mismatch)                # Line6
chrM:307    0    AGCGGGGA    129     #--> Select line with similar UMI
+  (NO mismatch)                # Line7
CLUSTER
chrM:307    0    TCAAAATG    130     #--> Now selected, Second line so
+rted by col3, col2, col4(rev sort)  #Line2 # NO similar UMI, 1 line c
+luster
CLUSTER
chrM:307    0    TCACGGTG    130     #--> Now selected, Third line sor
+ted by col3, col2, col4(rev sort)  # Line3
chrM:307    0    TCAGGGAG    130     #--> Select line with similar UMI
+  (Two mismatch with line3 UMI)    # Line4
chrM:307    2    TCAGGGTG    130     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line5
chrM:307    2    TCAGGGTG    129     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line9
chrM:307    1    TCAGGGTG    106     #--> Select line with similar UMI
+  (One mismatch with line3 UMI)    # Line10
CLUSTER
chrM:307    0    TCAGCCTG    129     #--> Now selected, line#8 sorted 
+by col3, col2, col4(rev sort)      #Line8, make single line cluster
CLUSTER
....and so on for next chr positions...
[download]

I hope above example make it more clear

I tried the below code, we try to first cluster by column1 and then, makes all possible pattern using function "@subpats" and match the next line in the file one by one. But, I guess I am having big issue in the looping. Below is my code:

#!/usr/bin/env/perl

#use strict;
#use warnings;

my $InSam = $ARGV[0];
my $SampleID = $ARGV[1];
chomp $InSam;
chomp $SampleID;


open(DATA, "$InSam") || die "Can't open $InSam: $!\n";
open my $OFILE, '>', $SampleID.'.cluster_ChrPos.txt' or die "Cannot cr
+eate file for output: $!";

use strict;
my %seen=();
my @flds;
my $UMI;

my @cluster_ChrPos;
while (<DATA>)
{
chomp;
@flds=split /\t/;

my $POS;




    if ($seen{$flds[0]}++)
        {
                $UMI=$flds[2];
                my @subpats;
                for my $i (0..length($UMI)-1) {
                for my $j ($i+1..length($UMI)-1) {
                my $subpat = join('',
                substr($UMI, 0, $i),
                '.',  # or '\\w'
                substr($UMI, $i+1, $j-$i-1),
                '.',  # or '\\w'
                substr($UMI, $j+1),
                );
                push @subpats, $subpat;
                }
                }

                my $pat = join('|', @subpats);
                        
        if($flds[0] =~ m/"$pat"/)
            {
            #print "$pat\t$flds[0]";
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");
            }
        else
            {
            #push (@cluster_ChrPos, "DELIMIT\n");
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");    
            }
        }    
        else
            {
            push (@cluster_ChrPos, "DELIMIT\n");
            push (@cluster_ChrPos, "$_");
            push (@cluster_ChrPos, "\n");
            }
}
push (@cluster_ChrPos, "DELIMIT");

print  $OFILE  "@cluster_ChrPos\n";
print  "@cluster_ChrPos\n";
close $OFILE;
close DATA;
[download]

Please help me out. Any suggestion is recommended.

Back to Seekers of Perl Wisdom