|Syntactic Confectionery Delight|
Find duplicate based on specific fields while allowing 2 mismatchby amitgsir (Novice)
|on Aug 28, 2017 at 06:53 UTC||Need Help??|
amitgsir has asked for the wisdom of the Perl Monks concerning the following question:
Hi, I have a file like this.
Let me put my question in more correct.
I have table with 4 columns (there are more columns in the file but all the important information has been extracted in these 4 columns to complete the task) like below:
Column1: It contains chromosome positions e.g. chrM:307, chr1:11888, etc. This column is the base of first sorting. So, assumes that all the 10 lines with chrM:307 positions makes 1 cluster which must be divided into multiple cluster based on the value in column3 (UMI).
Column2: This value is the NM tag in the (NGS data) SAM file which corresponds to the error in pcr/sequencing step. So, minimum the column value is good.
Column3: This columns shows the UMI (Unique Molecular Identifier) sequence which has been used during the PCR step. During PCR step there might be mutation in the UMI as well as the original DNA sequence (which is in the further columns in the file, not showed here).
Therefore, I want to allow 2 mismatch in the UMI column to consider the PCR error.
So, if some raw has same chr position and similar UMI (e.g. TCAGGGTG, TCAGAATG, TCAGGGAG) these represents that the reads are from the same original DNA fragment. However, its impossible to decide which one is the starting UMI sequence to make the pattern with 2 mismatch. So, if we take TCAGGGTG as starting UMI, TCAAAATG can not be clustered together with it. However, if we take TCAGGGAG as starting UMI TCAAAATG can make same cluster.
Therefore, selection of first read is important. I missed to explain this before in detail and I couldn't make any code for this. So, let's see, how to decide first read in below part(after column 4 info).
Column4: Column 4 is the length of the sequence in the SAM file, i.e. length of 15th column (not shown here since only length value is important nor sequence for my objective) in each line.
How to choose first line to select UMI for allowing 2 mismatch in next step:
Step1. Find all lines starting at same positions and make group. Now sort all lines by columns 3 (first sorting), then select line with minimum column 2 value (Second sorting) and maximum column 4 value (Third sorting).
Step2. Choose above selected line and decide as representative UMI (column 3). Now check remaining lines (with same starting positions) if have similar UMI (UMI upto 2 mismatch to selected representative UMI) make all lines as a single cluster.
Step3. Select line not having similar UMI in step two, and loop step 1 and 2 again, i.e. find minimum col2 and maximum col4 read, get representative UMI and find line with similar UMI to make new cluster. Loop through again and again until all lines are not used to make a cluster.
Step4. Select next group of lines, starting with next available positions and loop step 1,2 and 3.
It will also change the output how it looks like, I tried to make some artificial data for demonstration but missed the value of column 2 and 4.
Let's understand the final output in two step:
Step 1: This is what we will receive after sorting all line starting with chrM:307
Step 1: This is what we will receive after final clustering for the lines at position chrM:307
I hope above example make it more clearI tried the below code, we try to first cluster by column1 and then, makes all possible pattern using function "@subpats" and match the next line in the file one by one. But, I guess I am having big issue in the looping. Below is my code:
Please help me out. Any suggestion is recommended.