http://qs321.pair.com?node_id=1198131

amitgsir has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a file like this.

chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

Update

Let me put my question in more correct.

I have table with 4 columns (there are more columns in the file but all the important information has been extracted in these 4 columns to complete the task) like below:

Column1: It contains chromosome positions e.g. chrM:307, chr1:11888, etc. This column is the base of first sorting. So, assumes that all the 10 lines with chrM:307 positions makes 1 cluster which must be divided into multiple cluster based on the value in column3 (UMI).

Column2: This value is the NM tag in the (NGS data) SAM file which corresponds to the error in pcr/sequencing step. So, minimum the column value is good.

Column3: This columns shows the UMI (Unique Molecular Identifier) sequence which has been used during the PCR step. During PCR step there might be mutation in the UMI as well as the original DNA sequence (which is in the further columns in the file, not showed here).

Therefore, I want to allow 2 mismatch in the UMI column to consider the PCR error.

So, if some raw has same chr position and similar UMI (e.g. TCAGGGTG, TCAGAATG, TCAGGGAG) these represents that the reads are from the same original DNA fragment. However, its impossible to decide which one is the starting UMI sequence to make the pattern with 2 mismatch. So, if we take TCAGGGTG as starting UMI, TCAAAATG can not be clustered together with it. However, if we take TCAGGGAG as starting UMI TCAAAATG can make same cluster.

Therefore, selection of first read is important. I missed to explain this before in detail and I couldn't make any code for this. So, let's see, how to decide first read in below part(after column 4 info).

Column4: Column 4 is the length of the sequence in the SAM file, i.e. length of 15th column (not shown here since only length value is important nor sequence for my objective) in each line.

How to choose first line to select UMI for allowing 2 mismatch in next step:

Step1. Find all lines starting at same positions and make group. Now sort all lines by columns 3 (first sorting), then select line with minimum column 2 value (Second sorting) and maximum column 4 value (Third sorting).

Step2. Choose above selected line and decide as representative UMI (column 3). Now check remaining lines (with same starting positions) if have similar UMI (UMI upto 2 mismatch to selected representative UMI) make all lines as a single cluster.

Step3. Select line not having similar UMI in step two, and loop step 1 and 2 again, i.e. find minimum col2 and maximum col4 read, get representative UMI and find line with similar UMI to make new cluster. Loop through again and again until all lines are not used to make a cluster.

Step4. Select next group of lines, starting with next available positions and loop step 1,2 and 3.

It will also change the output how it looks like, I tried to make some artificial data for demonstration but missed the value of column 2 and 4.

Let's understand the final output in two step:

Step 1: This is what we will receive after sorting all line starting with chrM:307

chrM:307 0 AGCGGGGA 130 #-->Line1 chrM:307 0 TCAAAATG 130 #-->Line2 chrM:307 0 TCACGGTG 130 #-->Line3 chrM:307 0 TCAGGGAG 130 #-->Line4 chrM:307 2 TCAGGGTG 130 #-->Line5 chrM:307 0 AGCGGGGA 129 #-->Line6 chrM:307 0 AGCGGGGA 129 #-->Line7 chrM:307 0 TCAGCCTG 129 #-->Line8 chrM:307 2 TCAGGGTG 129 #-->Line9 chrM:307 1 TCAGGGTG 106 #-->Line10

Step 1: This is what we will receive after final clustering for the lines at position chrM:307

CLUSTER chrM:307 0 AGCGGGGA 130 #--> Select, first line sorted by + col3, col2, col4(rev sort) # Line1 chrM:307 0 AGCGGGGA 129 #--> Select line with similar UMI + (NO mismatch) # Line6 chrM:307 0 AGCGGGGA 129 #--> Select line with similar UMI + (NO mismatch) # Line7 CLUSTER chrM:307 0 TCAAAATG 130 #--> Now selected, Second line so +rted by col3, col2, col4(rev sort) #Line2 # NO similar UMI, 1 line c +luster CLUSTER chrM:307 0 TCACGGTG 130 #--> Now selected, Third line sor +ted by col3, col2, col4(rev sort) # Line3 chrM:307 0 TCAGGGAG 130 #--> Select line with similar UMI + (Two mismatch with line3 UMI) # Line4 chrM:307 2 TCAGGGTG 130 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line5 chrM:307 2 TCAGGGTG 129 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line9 chrM:307 1 TCAGGGTG 106 #--> Select line with similar UMI + (One mismatch with line3 UMI) # Line10 CLUSTER chrM:307 0 TCAGCCTG 129 #--> Now selected, line#8 sorted +by col3, col2, col4(rev sort) #Line8, make single line cluster CLUSTER ....and so on for next chr positions...

I hope above example make it more clear

I tried the below code, we try to first cluster by column1 and then, makes all possible pattern using function "@subpats" and match the next line in the file one by one. But, I guess I am having big issue in the looping. Below is my code:

#!/usr/bin/env/perl #use strict; #use warnings; my $InSam = $ARGV[0]; my $SampleID = $ARGV[1]; chomp $InSam; chomp $SampleID; open(DATA, "$InSam") || die "Can't open $InSam: $!\n"; open my $OFILE, '>', $SampleID.'.cluster_ChrPos.txt' or die "Cannot cr +eate file for output: $!"; use strict; my %seen=(); my @flds; my $UMI; my @cluster_ChrPos; while (<DATA>) { chomp; @flds=split /\t/; my $POS; if ($seen{$flds[0]}++) { $UMI=$flds[2]; my @subpats; for my $i (0..length($UMI)-1) { for my $j ($i+1..length($UMI)-1) { my $subpat = join('', substr($UMI, 0, $i), '.', # or '\\w' substr($UMI, $i+1, $j-$i-1), '.', # or '\\w' substr($UMI, $j+1), ); push @subpats, $subpat; } } my $pat = join('|', @subpats); if($flds[0] =~ m/"$pat"/) { #print "$pat\t$flds[0]"; push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } else { #push (@cluster_ChrPos, "DELIMIT\n"); push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } } else { push (@cluster_ChrPos, "DELIMIT\n"); push (@cluster_ChrPos, "$_"); push (@cluster_ChrPos, "\n"); } } push (@cluster_ChrPos, "DELIMIT"); print $OFILE "@cluster_ChrPos\n"; print "@cluster_ChrPos\n"; close $OFILE; close DATA;
Please help me out. Any suggestion is recommended.