in reply to Find duplicate based on specific fields while allowing 2 mismatch
This gives your "expected output" but there's a conflict between my sort and your sort description.
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1198131
use strict;
use warnings;
my %one;
while(<DATA>)
{
my ($one, $two, $three, $four) = split;
push @{ $one{$one} }, [ $two, $three, $four ];
}
for my $one (sort keys %one)
{
my %groups;
LOOP:
for ( sort {
$a->[1] cmp $b->[1] or
$b->[2] <=> $a->[2] or
$a->[0] <=> $b->[0] }
@{ $one{$one} } )
{
my ( $two, $three, $four ) = @$_;
if( keys %groups )
{
for ( sort keys %groups )
{
if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 )
{
push @{ $groups{$_} }, [ $two, $three, $four ];
next LOOP;
}
}
push @{ $groups{$three} }, [ $two, $three, $four ];
}
else
{
push @{ $groups{$three} }, [ $two, $three, $four ];
}
}
for ( sort keys %groups )
{
print 'CLUSTER';
for ( values @{ $groups{$_} } )
{
print join "\t", $one, @$_;
}
}
}
__DATA__
chrM:307 0 AGCGGGGA 129
chrM:307 0 AGCGGGGA 130
chrM:307 0 AGCGGGGA 129
chrM:308 0 AGCGGGGA 129
chrM:308 0 AGCGGGGA 130
chrM:308 0 AGCGGGGA 129
chrM:309 0 AGCGGGGA 129
chrM:309 0 AGCGGGGA 130
chrM:309 0 AGCGGGGA 129
chrM:307 0 TCAAAATG 130
chrM:308 0 TCAAAATG 130
chrM:309 0 TCAAAATG 130
chrM:307 0 TCACGGTG 130
chrM:308 0 TCACGGTG 130
chrM:309 0 TCACGGTG 130
chrM:307 0 TCAGCCTG 129
chrM:308 0 TCAGCCTG 129
chrM:309 0 TCAGCCTG 129
chrM:307 0 TCAGGGAG 130
chrM:308 0 TCAGGGAG 130
chrM:309 0 TCAGGGAG 130
chrM:307 1 TCAGGGTG 106
chrM:307 2 TCAGGGTG 130
chrM:307 2 TCAGGGTG 129
chrM:308 1 TCAGGGTG 106
chrM:308 2 TCAGGGTG 130
chrM:308 2 TCAGGGTG 129
chrM:309 1 TCAGGGTG 106
chrM:309 2 TCAGGGTG 130
chrM:309 2 TCAGGGTG 129
Output:
CLUSTER
chrM:307 0 AGCGGGGA 130
chrM:307 0 AGCGGGGA 129
chrM:307 0 AGCGGGGA 129
CLUSTER
chrM:307 0 TCAAAATG 130
CLUSTER
chrM:307 0 TCACGGTG 130
chrM:307 0 TCAGGGAG 130
chrM:307 2 TCAGGGTG 130
chrM:307 2 TCAGGGTG 129
chrM:307 1 TCAGGGTG 106
CLUSTER
chrM:307 0 TCAGCCTG 129
CLUSTER
chrM:308 0 AGCGGGGA 130
chrM:308 0 AGCGGGGA 129
chrM:308 0 AGCGGGGA 129
CLUSTER
chrM:308 0 TCAAAATG 130
CLUSTER
chrM:308 0 TCACGGTG 130
chrM:308 0 TCAGGGAG 130
chrM:308 2 TCAGGGTG 130
chrM:308 2 TCAGGGTG 129
chrM:308 1 TCAGGGTG 106
CLUSTER
chrM:308 0 TCAGCCTG 129
CLUSTER
chrM:309 0 AGCGGGGA 130
chrM:309 0 AGCGGGGA 129
chrM:309 0 AGCGGGGA 129
CLUSTER
chrM:309 0 TCAAAATG 130
CLUSTER
chrM:309 0 TCACGGTG 130
chrM:309 0 TCAGGGAG 130
chrM:309 2 TCAGGGTG 130
chrM:309 2 TCAGGGTG 129
chrM:309 1 TCAGGGTG 106
CLUSTER
chrM:309 0 TCAGCCTG 129
Re^2: Find duplicate based on specific fields while allowing 2 mismatch
by Anonymous Monk on Aug 29, 2017 at 00:38 UTC
|
Hi, thanks for your time! It is providing the expected output.
Finally, I am wordering about two more thing:
1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.
2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.
For example: When I am passing the below data as input:
Out is below:
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1198131
use strict;
use warnings;
my %one;
while(<DATA>)
{
my ($one, $two, $three, $four) = split;
/(\d+)/; # get number for later sort
push @{ $one{$1} }, [ $two, $three, $four, $_ ];
}
for my $one (sort { $a <=> $b } keys %one)
{
my %groups;
LOOP:
for ( sort {
$a->[1] cmp $b->[1] or
$b->[2] <=> $a->[2] or
$a->[0] <=> $b->[0] }
@{ $one{$one} } )
{
my ( $two, $three, $four, $all ) = @$_;
if( keys %groups )
{
for ( sort keys %groups )
{
if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 )
{
push @{ $groups{$_} }, [ $two, $three, $four, $all ];
next LOOP;
}
}
push @{ $groups{$three} }, [ $two, $three, $four, $all ];
}
else
{
push @{ $groups{$three} }, [ $two, $three, $four, $all ];
}
}
for ( sort keys %groups )
{
print "CLUSTER\n";
print $_->[3] for values @{ $groups{$_} };
}
}
__DATA__
chrM:307 0 TCAGGGTG 115
chrM:307 0 TCAGGGTG 107
chrM:307 0 TCAGGGTG 115
chrM:307 0 TCAGGGTG 130
chrM:307 0 TCAGGGTG 114
chrM:307 1 TCAGGGTG 106
chrM:310 0 TCAGGGTG 99
chrM:392 2 CCTCTTAT 130
chrM:396 2 AGTTACTA 129
chrM:443 0 ATTATCAA 130 extra columns
chrM:542 2 AATCCAAA 129
chrM:542 0 AATCCAAA 129
chrM:934 0 CATTCGCT 129
chrM:934 1 CATTCGCT 129
chrM:1001 0 CGTGACAT 129
chrM:1127 0 GATACTAA 130
chrM:1257 1 TGGAAATC 129
chrM:1262 0 CGGGAAGC 129
chrM:1262 0 AGGGAAGG 129
chrM:1603 0 GTATCGGA 130
chrM:1603 1 GTATCGGA 130
Outputs:
CLUSTER
chrM:307 0 TCAGGGTG 130
chrM:307 0 TCAGGGTG 115
chrM:307 0 TCAGGGTG 115
chrM:307 0 TCAGGGTG 114
chrM:307 0 TCAGGGTG 107
chrM:307 1 TCAGGGTG 106
CLUSTER
chrM:310 0 TCAGGGTG 99
CLUSTER
chrM:392 2 CCTCTTAT 130
CLUSTER
chrM:396 2 AGTTACTA 129
CLUSTER
chrM:443 0 ATTATCAA 130 extra columns
CLUSTER
chrM:542 0 AATCCAAA 129
chrM:542 2 AATCCAAA 129
CLUSTER
chrM:934 0 CATTCGCT 129
chrM:934 1 CATTCGCT 129
CLUSTER
chrM:1001 0 CGTGACAT 129
CLUSTER
chrM:1127 0 GATACTAA 130
CLUSTER
chrM:1257 1 TGGAAATC 129
CLUSTER
chrM:1262 0 AGGGAAGG 129
chrM:1262 0 CGGGAAGC 129
CLUSTER
chrM:1603 0 GTATCGGA 130
chrM:1603 1 GTATCGGA 130
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Final sorting looks bad now:
even the cluster are mixed in the results for the large dataset:
e.g. This is how output looks like now
Different chr_POS are listed in same cluster now.
CLUSTER
chr1:96495 1 AAACAAAG 129 83A45 NB501670:42:HJL7WAFXX:4
+:21405:10379:16676 83 chr1 96495 24 129M = 96409
+ -215 AACGAATGGGTGATTTCCCTAGTCACTGCAGTGTGAGGAAATCTACAAAATTAATTT
+CACAATACGCTTTACAGGATAGGTGGTGAAACACATGAAGTACAACTGCAGTGGGTTATAAAAAACGGC
+CTT EA<A/<E<A</EEEEEEEEEEEEEAEEEEEEEEEEA<AEEEE<EAE/EE<EAEEEEEEEEEE
+/AEEEEEEEEEEEEEEEE/<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE
+ MD:Z:83A45 RG:Z:Sample NM:i:1 AS:i:124 XS:i:119 RX:Z
+:AAACAAAG
chr1:948997 0 AAACATCG 34 25 NB501670:42:HJL7WAFXX:4:21
+409:13718:4079 99 chr1 948997 57 25M9S = 948997
+ 25 CCAGCCAATTTTCGTCTCCCTCCCCTGCCATTTT EEEEEEEEEEEEEEEEEEEEEE
+EEEEEEEEEEEE MD:Z:25 RG:Z:Sample NM:i:0 AS:i:25 XS:i:2
+0 RX:Z:AAACATCG
chr1:991640 3 GTACAAAG 127 6T12^C64C25 NB501670:42:HJL7
+WAFXX:2:21310:15766:15692 83 chr1 991640 60 18S19M1D90
+M = 991640 -110 CATGTCTGAACTCAAAGTCCTGAGGGGGGAGCACACATGCT
+GAGCACTGTGGGAGGCGGGGCCGTGGAGGCAGGAGGCTCTCTGGCGTGCACGTGTGGGTGTGTGTACGT
+GTGGGGGTGTGTGTGTG A<EE/EAEEA/E/A/EAEAEEAEE/EEAEEEE<EEAEEEEEEEEEEEE
+EAEEEAEAEAEEEEEAEEAEE<EAEEEEEE/EAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE
+EEEEEEEEEE MD:Z:6T12^C64C25 RG:Z:Sample NM:i:3 AS:i:92
+ XS:i:25 RX:Z:GTACAAAG
chr1:991640 2 GTACAAAG 126 19^C64C25 NB501670:42:HJL7WA
+FXX:1:11311:17221:7023 83 chr1 991640 60 17S19M1D90M
+ = 991640 -110 ATGTCTGAACTCAAAGTCCTGAGTGGGGAGCACACATGCTGAGC
+ACTGTGGGAGGCGGGGCCGTGGAGGCAGGAGGCTCTCTGGCGTGCACGTGTGGGTGTGTGTACGTGTGG
+GGGTGTGTGTGTG <EAAAEAAA/</AEAE<A<EEEA/EAEEAE/E<EEE/EEAEEEAE<EA/EEE
+EA/EAAEEEAEAEEEEAEEEEEEEEAEEEEEEEE/EEEEE/EEEEEEEEEEEEEEEEEAEEEEEAEEEE
+EEEEE MD:Z:19^C64C25 RG:Z:Sample NM:i:2 AS:i:97 XS:i:2
+5 RX:Z:GTACAAAG
CLUSTER
chr1:953842 1 AAAGCCTA 89 87G1 NB501670:42:HJL7WAFXX:4:
+11410:9709:15037 99 chr1 953842 60 89M = 953842
+ 89 AAGGCAGCTAAGGCCTGGCGAGTAATCGAGTGCAGCGCCAGTGGGCTGGCACTGCTGGGG
+GACCCACTACACCCTCCGCAGCCGCTGTC EEEEEEEEEAEEEEEEEEEEEEEE6EEEEEEEEEEE
+EEEEEEEEEEEEEEAEEEEEEEEEE/EEE/EEAE/EEEEEEEEAEEAAEEE/E MD:Z:87G1
+ RG:Z:Sample NM:i:1 AS:i:87 XS:i:52 RX:Z:AAAGCCTA
chr1:953842 2 AAAGCCTA 89 66C20G1 NB501670:42:HJL7WAFXX
+:3:11604:9073:7007 99 chr1 953842 60 89M = 95384
+2 89 AAGGCAGCTAAGGCCTGGCGAGTAATCGAGTGCAGCGCCAGTGGGCTGGCACTGCTGG
+GGGACCCATTACACCCTCCGCAGCCGCTGTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
+EEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEEEE<EEEEE MD:Z:66C20
+G1 RG:Z:Sample NM:i:2 AS:i:82 XS:i:62 RX:Z:AAAGCCTA
CLUSTER
chr1:1082375 1 AAATACAC 128 95A32 NB501670:42:HJL7WAFXX
+:2:21207:10611:7181 99 chr1 1082375 60 128M = 10
+82386 155 GTGGCCCCTGGCCACTTGCACTTGCAGAGGGCGTTAGAGCCTAGGGACCAGGT
+GACACCAAGGACAGCCCTGGGGCGGTGGGTTCAGAGGTCAGAGCAGGAGGGGCCAGAAAAGGAGCCACC
+AGGGGC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEEEEEEEEEE
+EEEEEEEEEEEEEEEEEEEAEEEAAEEAEEEEEEE/EEEEEEEEEEAEEEEEEEAEAAEE/EEA6EEEE
+ MD:Z:95A32 RG:Z:Sample NM:i:1 AS:i:123 XS:i:22 RX:
+Z:AAATACAC
chr1:1610531 0 AAATTCTC 130 130 NB501670:42:HJL7WAFXX:2
+:21104:13686:5877 99 chr1 1610531 60 130M = 1610
+680 284 GCTGTCACAGCACCCGCTACACAGGCTCTGCCACCACCAGCGAGTTTCTAAAACC
+AAATTCATTTACATGGCAAGGAGGCCACGCTCAAGAAACCCCTCCAGGAGCAAGGAACAGCACGTGGGC
+TCGGGC <EEEEEEEEAEEEAEEEEAEEEEEEEEEEEEEAEAEEEEEEEEEE<<AAEEEEEEEEEE
+EE/<EEAEE<AEEEEAEEEEEEEEEEEEEEEAA/EE<AEAAAA<EEEEE//E<E<EE<EAAAAAE<EEA
+<< MD:Z:130 RG:Z:Sample NM:i:0 AS:i:130 XS:i:20 RX:
+Z:AAATTCTC
chr1:1610531 0 AAATTCTC 127 127 NB501670:42:HJL7WAFXX:4
+:11607:15493:16843 99 chr1 1610531 60 127M = 161
+0680 284 GCTGTCACAGCACCCGCTACACAGGCTCTGCCACCACCAGCGAGTTTCTAAAAC
+CAAATTCATTTACATGGCAAGGAGGCCACGCTCAAGAAACCCCTCCAGGAGCAAGGAACAGCACGTGGG
+CTCG EEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEE
+EEEEEEEEEEAEAEE/EEEEEEEAAEEEEE</EA<<<AEE/A/E<EE//EAAEA<AEEEAE/EEE<
+ MD:Z:127 RG:Z:Sample NM:i:0 AS:i:127 XS:i:20 RX:Z:AAA
+TTCTC
chr1:1530847 3 GAATAAAC 92 18^AC3C37 NB501670:42:HJL7WA
+FXX:3:21601:25889:16656 99 chr1 1530847 0 18S18M2D41M1
+5S = 1530847 61 AGAGAGCAGAACGGGGAGAGACAGAGAGAGAGAGAGAGAGA
+CAGAGAGAGCAGAACAGGGAGAAACAGAGAGACAGAAGCAGGGAGGAGAGA EEEEEEEEEEEEEE
+EEEEEEEEEEEEEEEEEEEEEEEEAEEEE/E//AE/<EAAEEEA/AE/AAEEEEEEA/AEA//E//EE<
+<EE/EEEEE XA:Z:chr1,+1530495,16S61M15S,3;chr1,+1531341,22S49M21S,2
+;chr1,+1531035,18S57M17S,4;chr8,-90728798,40S37M15S,0;chr4,-176905131
+,40S37M15S,0; MD:Z:18^AC3C37 RG:Z:Sample NM:i:3 AS:i:46
+ XS:i:46 RX:Z:GAATAAAC
CLUSTER
chr1:824050 0 AAATTAGC 84 47 NB501670:42:HJL7WAFXX:4:11
+610:4671:5743 83 chr1 824050 60 37S47M = 823980
+ -117 GGGTGGGTGGGAACGGCGACTGGGTGGGTGAGCGGGCGGGAGGGAGGAAAAGAAAGAG
+AGAAAGGTGAAAGGTGGGGACGGGAA E/E/EE</E//////////A/AE//EE//A/6/EEE/EE
+//EEEEE<EA<EEEEEEEE/<EEAEEAEE/EEEEEEEEAAEEEEE MD:Z:47 RG:Z:Samp
+le NM:i:0 AS:i:47 XS:i:34 RX:Z:AAATTAGC
chr1:565747 0 AAATTAGG 127 110 NB501670:42:HJL7WAFXX:4:
+21612:6144:5953 83 chr1 565747 0 17S110M = 56574
+7 -110 GTTCAAACCGCCAGGAGTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCC
+CGCTAACCGGCTTTTTGCCCAAATGGGCCATTATCGAAGAATTCACAAAAAACAATAGCCTCATCATCC
+CC EEAEEEEEE<EEEEEEEEEEAEEEEEAAAEEEEEEEEEEEAAE/AEAE<E<AAAEA/E<EEEE
+AEEEA6EEEEEEEEEEEEEE<EAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE X
+A:Z:chrM,-5198,17S110M,0; MD:Z:110 RG:Z:Sample NM:i:0 AS:
+i:110 XS:i:110 RX:Z:AAATTAGG
chr1:1578228 5 ATATTATC 129 32C45T9G14C15T9 NB501670:42
+:HJL7WAFXX:1:21305:8626:20091 99 chr1 1578228 0 129M
+ = 1578233 148 GAGGTGCTCTGGAGTCTACTGAAGGTTTGCAAATTCAGGGGGAA
+TCTTGGAGAGTAAACTGTGATTCATTAATCAACGCCACCGCTTCTCACATTAGTGGCTCACACCTCACT
+CCCCGCAGGCAGGCAG E/6E///AE/E///E//E/EE///E/<EE/E//EE///EE/AE/E/EE6
+/////AE/EE/E</E///EE6AEEEE//E////E/AA<6/6E//EEE/A/EE6E/A//EEAE///EE/<
+/////A/AA/< XA:Z:chr1,+1641438,129M,5; MD:Z:32C45T9G14C15T9
+RG:Z:Sample NM:i:5 AS:i:104 XS:i:104 RX:Z:ATATTATC
CLUSTER
chr1:116740 0 AACCAAGT 130 130 NB501670:42:HJL7WAFXX:4:
+11602:6597:2684 83 chr1 116740 0 130M = 116716
+ -154 ATTGCTCTTGCCTGTCCTTCAAGTCTATTCTTAAATGTCCCATTCTCTGTGAAGCTTTC
+CTGCCCACCCTATTTAAATTACAGACTTCACTCCCAATTCCCCATCTACTTTAAGAGTCTTCATTTATC
+AT <AAA/E/EAA<EEAEE<EEAAA<EEAEEEAEA<EEEEEE<EEEEEEEEEE/AE6EEEEEEEE/
+/EEEA/EAEEEEEEEEEEEEEEEEE/EEEEEEEEEE/E//EEEEEEEEEAEEEEAEE//EEAEEEEE
+ MD:Z:130 RG:Z:Sample NM:i:0 AS:i:130 XS:i:130 RX:Z:A
+ACCAAGT
chr1:1401547 1 ACCCACGT 90 12T72 NB501670:42:HJL7WAFXX:
+1:21212:19912:16100 83 chr1 1401547 60 5S85M = 1
+401547 -85 TCAGAGGCAAGCAGAGGCTGCGGTGAGCCGAGATCCTGCCATTGCACCCCAG
+CCTGGGCAAGAAGAGCAAACTTCTGTCTCAAAAAAAAA <A6/A<EEEEAEA<E<E<EE//AE/A6
+EAEEEEEEAEEEEEAEEEEEEEEEEEEAAEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEE MD
+:Z:12T72 RG:Z:Sample NM:i:1 AS:i:80 XS:i:37 RX:Z:ACCCA
+CGT
CLUSTER
chr1:1275215 2 AACCGTAA 60 49A7^A3 NB501670:42:HJL7WAFX
+X:4:21512:21580:5204 99 chr1 1275215 60 57M1D3M =
+ 1275322 135 GGTGCAGGCAGGATGTGCAGCTCAGTCCACCGCCCCCGCAGACCCACCC
+GCAGCCGCTGT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EAAEEE/EAE/E<A/EAAE
+EAE//< MD:Z:49A7^A3 RG:Z:Sample NM:i:2 AS:i:52 XS:i:26
+ RX:Z:AACCGTAA
chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:2:1
+1309:10764:17151 99 chr1 1072050 60 92M6S = 1072
+050 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTT
+CATTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT EEEEEEEEEEEEEEEAEEEEEEE
+EEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEAEEEE<E/EEAEEAEEEEEEEEEEEEEEEEEE<E
+EEAEAA MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX
+:Z:AAGCGCAA
chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:3:1
+1402:26720:3870 99 chr1 1072050 60 92M6S = 10720
+50 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTTC
+ATTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT AAEEEAEAAEEEE/EEEEEE/EEE
+EEAEEEEEEEAEEEEEEEEEEEEEEAEAEEEEEEEEEEEAEE/EEE/EAEEEEEAAEEAEEEEEEEEEE
+AEEE< MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX:
+Z:AAGCGCAA
chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:3:1
+1403:1261:3853 99 chr1 1072050 60 92M6S = 107205
+0 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTTCA
+TTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT EEE/<EEEAEEEE/EEE6EE/EEEE
+EEEAEE</<EEAE//AEEEE6AAA6/A/A<AAE//<EEEAAEE/EEEAA/EAEEEEAEE/EE/AE<<A6
+EA<A MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX:Z
+:AAGCGCAA
chr1:1072050 0 AAGCGCAA 83 67 NB501670:42:HJL7WAFXX:2:2
+1307:15105:17182 99 chr1 1072050 60 67M16S = 107
+2050 67 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACT
+TCATTTGCCCACATCGTTGTAAGCCTTA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
+EEEEEEEEEEEEEEE/EEEEEEEAEEEEEAEEEEAEEEEAEE/<<E MD:Z:67 RG:Z:Sam
+ple NM:i:0 AS:i:67 XS:i:19 RX:Z:AAGCGCAA
CLUSTER
chr1:930467 3 AAGACGGT 130 66C33A12A16 NB501670:42:HJL7
+WAFXX:1:21101:11064:7087 83 chr1 930467 60 130M =
+ 930292 -305 AGCCTGTAATCCCAGCACTTTGGGAGGCCAAGACAGGCAGATCACTTGA
+GGTCAGAAGTTCGAGACGAGCCTAGCTTCAACAAGGTGAAACCCCGTCTCTGCTAAAAATACAAGAATT
+AGCCAGGCACGA AAAEAAEEEEAEEEEEEEEEEAEEEA<AEEEEAEEEEEEEAEEEEEEEAEE/E
+EEAEE/EEEAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
+EEEEEEEE MD:Z:66C33A12A16 RG:Z:Sample NM:i:3 AS:i:115
+XS:i:79 RX:Z:AAGACGGT
chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1
+1101:17835:9283 83 chr1 1714005 60 4S60M = 17140
+05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG
+CGCAAGCG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE
+AEEEAEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R
+X:Z:AATGCGGT
chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1
+1103:25947:7125 83 chr1 1714005 60 4S60M = 17140
+05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG
+CGCAAGCG EAAEEEEEEEEEEE6/EEEAEAE6EEAEEEEE//EEEEE/EEEEEEAEEAAEEEEEA
+EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R
+X:Z:AATGCGGT
chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1
+1201:14812:3768 83 chr1 1714005 60 4S60M = 17140
+05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG
+CGCAAGCG AEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE
+EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R
+X:Z:AATGCGGT
chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1
+1204:12625:5665 83 chr1 1714005 60 4S60M = 17140
+05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG
+CGCAAGCG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
+EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R
+X:Z:AATGCGGT
| [reply] [Watch: Dir/Any] [d/l] |
|
|
Re^2: Find duplicate based on specific fields while allowing 2 mismatch
by amitgsir (Novice) on Aug 29, 2017 at 00:47 UTC
|
Hi, thanks for your time! It is providing the expected output.
Sorry I made the post without login and missed the formating for the output. I Can't modify it so I am posting here again.
Finally, I am wordering about two more thing:
1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.
2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.
For example: When I am passing the below data as input:
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
|