Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: Find duplicate based on specific fields while allowing 2 mismatch

by tybalt89 (Monsignor)
on Aug 28, 2017 at 18:00 UTC ( [id://1198199]=note: print w/replies, xml ) Need Help??


in reply to Find duplicate based on specific fields while allowing 2 mismatch

This gives your "expected output" but there's a conflict between my sort and your sort description.

#!/usr/bin/perl -l # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; push @{ $one{$one} }, [ $two, $three, $four ]; } for my $one (sort keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four ]; } else { push @{ $groups{$three} }, [ $two, $three, $four ]; } } for ( sort keys %groups ) { print 'CLUSTER'; for ( values @{ $groups{$_} } ) { print join "\t", $one, @$_; } } } __DATA__ chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:307 0 TCAAAATG 130 chrM:308 0 TCAAAATG 130 chrM:309 0 TCAAAATG 130 chrM:307 0 TCACGGTG 130 chrM:308 0 TCACGGTG 130 chrM:309 0 TCACGGTG 130 chrM:307 0 TCAGCCTG 129 chrM:308 0 TCAGCCTG 129 chrM:309 0 TCAGCCTG 129 chrM:307 0 TCAGGGAG 130 chrM:308 0 TCAGGGAG 130 chrM:309 0 TCAGGGAG 130 chrM:307 1 TCAGGGTG 106 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129

Output:

CLUSTER chrM:307 0 AGCGGGGA 130 chrM:307 0 AGCGGGGA 129 chrM:307 0 AGCGGGGA 129 CLUSTER chrM:307 0 TCAAAATG 130 CLUSTER chrM:307 0 TCACGGTG 130 chrM:307 0 TCAGGGAG 130 chrM:307 2 TCAGGGTG 130 chrM:307 2 TCAGGGTG 129 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:307 0 TCAGCCTG 129 CLUSTER chrM:308 0 AGCGGGGA 130 chrM:308 0 AGCGGGGA 129 chrM:308 0 AGCGGGGA 129 CLUSTER chrM:308 0 TCAAAATG 130 CLUSTER chrM:308 0 TCACGGTG 130 chrM:308 0 TCAGGGAG 130 chrM:308 2 TCAGGGTG 130 chrM:308 2 TCAGGGTG 129 chrM:308 1 TCAGGGTG 106 CLUSTER chrM:308 0 TCAGCCTG 129 CLUSTER chrM:309 0 AGCGGGGA 130 chrM:309 0 AGCGGGGA 129 chrM:309 0 AGCGGGGA 129 CLUSTER chrM:309 0 TCAAAATG 130 CLUSTER chrM:309 0 TCACGGTG 130 chrM:309 0 TCAGGGAG 130 chrM:309 2 TCAGGGTG 130 chrM:309 2 TCAGGGTG 129 chrM:309 1 TCAGGGTG 106 CLUSTER chrM:309 0 TCAGCCTG 129

Replies are listed 'Best First'.
Re^2: Find duplicate based on specific fields while allowing 2 mismatch
by Anonymous Monk on Aug 29, 2017 at 00:38 UTC

    Hi, thanks for your time! It is providing the expected output.

    Finally, I am wordering about two more thing:

    1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.

    2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.

    For example: When I am passing the below data as input:

    Out is below:

      Like this? (Just a few minor tweaks :)

      BTW, it was sorting, just not numerically.

      #!/usr/bin/perl # http://perlmonks.org/?node_id=1198131 use strict; use warnings; my %one; while(<DATA>) { my ($one, $two, $three, $four) = split; /(\d+)/; # get number for later sort push @{ $one{$1} }, [ $two, $three, $four, $_ ]; } for my $one (sort { $a <=> $b } keys %one) { my %groups; LOOP: for ( sort { $a->[1] cmp $b->[1] or $b->[2] <=> $a->[2] or $a->[0] <=> $b->[0] } @{ $one{$one} } ) { my ( $two, $three, $four, $all ) = @$_; if( keys %groups ) { for ( sort keys %groups ) { if( ( $_ ^ "$three" ) =~ tr/\0//c <= 2 ) { push @{ $groups{$_} }, [ $two, $three, $four, $all ]; next LOOP; } } push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } else { push @{ $groups{$three} }, [ $two, $three, $four, $all ]; } } for ( sort keys %groups ) { print "CLUSTER\n"; print $_->[3] for values @{ $groups{$_} }; } } __DATA__ chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 107 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 114 chrM:307 1 TCAGGGTG 106 chrM:310 0 TCAGGGTG 99 chrM:392 2 CCTCTTAT 130 chrM:396 2 AGTTACTA 129 chrM:443 0 ATTATCAA 130 extra columns chrM:542 2 AATCCAAA 129 chrM:542 0 AATCCAAA 129 chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 chrM:1001 0 CGTGACAT 129 chrM:1127 0 GATACTAA 130 chrM:1257 1 TGGAAATC 129 chrM:1262 0 CGGGAAGC 129 chrM:1262 0 AGGGAAGG 129 chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130

      Outputs:

      CLUSTER chrM:307 0 TCAGGGTG 130 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 115 chrM:307 0 TCAGGGTG 114 chrM:307 0 TCAGGGTG 107 chrM:307 1 TCAGGGTG 106 CLUSTER chrM:310 0 TCAGGGTG 99 CLUSTER chrM:392 2 CCTCTTAT 130 CLUSTER chrM:396 2 AGTTACTA 129 CLUSTER chrM:443 0 ATTATCAA 130 extra columns CLUSTER chrM:542 0 AATCCAAA 129 chrM:542 2 AATCCAAA 129 CLUSTER chrM:934 0 CATTCGCT 129 chrM:934 1 CATTCGCT 129 CLUSTER chrM:1001 0 CGTGACAT 129 CLUSTER chrM:1127 0 GATACTAA 130 CLUSTER chrM:1257 1 TGGAAATC 129 CLUSTER chrM:1262 0 AGGGAAGG 129 chrM:1262 0 CGGGAAGC 129 CLUSTER chrM:1603 0 GTATCGGA 130 chrM:1603 1 GTATCGGA 130

        Final sorting looks bad now:

        even the cluster are mixed in the results for the large dataset:

        e.g. This is how output looks like now

        Different chr_POS are listed in same cluster now.

        CLUSTER chr1:96495 1 AAACAAAG 129 83A45 NB501670:42:HJL7WAFXX:4 +:21405:10379:16676 83 chr1 96495 24 129M = 96409 + -215 AACGAATGGGTGATTTCCCTAGTCACTGCAGTGTGAGGAAATCTACAAAATTAATTT +CACAATACGCTTTACAGGATAGGTGGTGAAACACATGAAGTACAACTGCAGTGGGTTATAAAAAACGGC +CTT EA<A/<E<A</EEEEEEEEEEEEEAEEEEEEEEEEA<AEEEE<EAE/EE<EAEEEEEEEEEE +/AEEEEEEEEEEEEEEEE/<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE + MD:Z:83A45 RG:Z:Sample NM:i:1 AS:i:124 XS:i:119 RX:Z +:AAACAAAG chr1:948997 0 AAACATCG 34 25 NB501670:42:HJL7WAFXX:4:21 +409:13718:4079 99 chr1 948997 57 25M9S = 948997 + 25 CCAGCCAATTTTCGTCTCCCTCCCCTGCCATTTT EEEEEEEEEEEEEEEEEEEEEE +EEEEEEEEEEEE MD:Z:25 RG:Z:Sample NM:i:0 AS:i:25 XS:i:2 +0 RX:Z:AAACATCG chr1:991640 3 GTACAAAG 127 6T12^C64C25 NB501670:42:HJL7 +WAFXX:2:21310:15766:15692 83 chr1 991640 60 18S19M1D90 +M = 991640 -110 CATGTCTGAACTCAAAGTCCTGAGGGGGGAGCACACATGCT +GAGCACTGTGGGAGGCGGGGCCGTGGAGGCAGGAGGCTCTCTGGCGTGCACGTGTGGGTGTGTGTACGT +GTGGGGGTGTGTGTGTG A<EE/EAEEA/E/A/EAEAEEAEE/EEAEEEE<EEAEEEEEEEEEEEE +EAEEEAEAEAEEEEEAEEAEE<EAEEEEEE/EAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEE +EEEEEEEEEE MD:Z:6T12^C64C25 RG:Z:Sample NM:i:3 AS:i:92 + XS:i:25 RX:Z:GTACAAAG chr1:991640 2 GTACAAAG 126 19^C64C25 NB501670:42:HJL7WA +FXX:1:11311:17221:7023 83 chr1 991640 60 17S19M1D90M + = 991640 -110 ATGTCTGAACTCAAAGTCCTGAGTGGGGAGCACACATGCTGAGC +ACTGTGGGAGGCGGGGCCGTGGAGGCAGGAGGCTCTCTGGCGTGCACGTGTGGGTGTGTGTACGTGTGG +GGGTGTGTGTGTG <EAAAEAAA/</AEAE<A<EEEA/EAEEAE/E<EEE/EEAEEEAE<EA/EEE +EA/EAAEEEAEAEEEEAEEEEEEEEAEEEEEEEE/EEEEE/EEEEEEEEEEEEEEEEEAEEEEEAEEEE +EEEEE MD:Z:19^C64C25 RG:Z:Sample NM:i:2 AS:i:97 XS:i:2 +5 RX:Z:GTACAAAG CLUSTER chr1:953842 1 AAAGCCTA 89 87G1 NB501670:42:HJL7WAFXX:4: +11410:9709:15037 99 chr1 953842 60 89M = 953842 + 89 AAGGCAGCTAAGGCCTGGCGAGTAATCGAGTGCAGCGCCAGTGGGCTGGCACTGCTGGGG +GACCCACTACACCCTCCGCAGCCGCTGTC EEEEEEEEEAEEEEEEEEEEEEEE6EEEEEEEEEEE +EEEEEEEEEEEEEEAEEEEEEEEEE/EEE/EEAE/EEEEEEEEAEEAAEEE/E MD:Z:87G1 + RG:Z:Sample NM:i:1 AS:i:87 XS:i:52 RX:Z:AAAGCCTA chr1:953842 2 AAAGCCTA 89 66C20G1 NB501670:42:HJL7WAFXX +:3:11604:9073:7007 99 chr1 953842 60 89M = 95384 +2 89 AAGGCAGCTAAGGCCTGGCGAGTAATCGAGTGCAGCGCCAGTGGGCTGGCACTGCTGG +GGGACCCATTACACCCTCCGCAGCCGCTGTC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE +EEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAEEEEEEEEEE<EEEEE MD:Z:66C20 +G1 RG:Z:Sample NM:i:2 AS:i:82 XS:i:62 RX:Z:AAAGCCTA CLUSTER chr1:1082375 1 AAATACAC 128 95A32 NB501670:42:HJL7WAFXX +:2:21207:10611:7181 99 chr1 1082375 60 128M = 10 +82386 155 GTGGCCCCTGGCCACTTGCACTTGCAGAGGGCGTTAGAGCCTAGGGACCAGGT +GACACCAAGGACAGCCCTGGGGCGGTGGGTTCAGAGGTCAGAGCAGGAGGGGCCAGAAAAGGAGCCACC +AGGGGC EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEEEEEEEEEE +EEEEEEEEEEEEEEEEEEEAEEEAAEEAEEEEEEE/EEEEEEEEEEAEEEEEEEAEAAEE/EEA6EEEE + MD:Z:95A32 RG:Z:Sample NM:i:1 AS:i:123 XS:i:22 RX: +Z:AAATACAC chr1:1610531 0 AAATTCTC 130 130 NB501670:42:HJL7WAFXX:2 +:21104:13686:5877 99 chr1 1610531 60 130M = 1610 +680 284 GCTGTCACAGCACCCGCTACACAGGCTCTGCCACCACCAGCGAGTTTCTAAAACC +AAATTCATTTACATGGCAAGGAGGCCACGCTCAAGAAACCCCTCCAGGAGCAAGGAACAGCACGTGGGC +TCGGGC <EEEEEEEEAEEEAEEEEAEEEEEEEEEEEEEAEAEEEEEEEEEE<<AAEEEEEEEEEE +EE/<EEAEE<AEEEEAEEEEEEEEEEEEEEEAA/EE<AEAAAA<EEEEE//E<E<EE<EAAAAAE<EEA +<< MD:Z:130 RG:Z:Sample NM:i:0 AS:i:130 XS:i:20 RX: +Z:AAATTCTC chr1:1610531 0 AAATTCTC 127 127 NB501670:42:HJL7WAFXX:4 +:11607:15493:16843 99 chr1 1610531 60 127M = 161 +0680 284 GCTGTCACAGCACCCGCTACACAGGCTCTGCCACCACCAGCGAGTTTCTAAAAC +CAAATTCATTTACATGGCAAGGAGGCCACGCTCAAGAAACCCCTCCAGGAGCAAGGAACAGCACGTGGG +CTCG EEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEE +EEEEEEEEEEAEAEE/EEEEEEEAAEEEEE</EA<<<AEE/A/E<EE//EAAEA<AEEEAE/EEE< + MD:Z:127 RG:Z:Sample NM:i:0 AS:i:127 XS:i:20 RX:Z:AAA +TTCTC chr1:1530847 3 GAATAAAC 92 18^AC3C37 NB501670:42:HJL7WA +FXX:3:21601:25889:16656 99 chr1 1530847 0 18S18M2D41M1 +5S = 1530847 61 AGAGAGCAGAACGGGGAGAGACAGAGAGAGAGAGAGAGAGA +CAGAGAGAGCAGAACAGGGAGAAACAGAGAGACAGAAGCAGGGAGGAGAGA EEEEEEEEEEEEEE +EEEEEEEEEEEEEEEEEEEEEEEEAEEEE/E//AE/<EAAEEEA/AE/AAEEEEEEA/AEA//E//EE< +<EE/EEEEE XA:Z:chr1,+1530495,16S61M15S,3;chr1,+1531341,22S49M21S,2 +;chr1,+1531035,18S57M17S,4;chr8,-90728798,40S37M15S,0;chr4,-176905131 +,40S37M15S,0; MD:Z:18^AC3C37 RG:Z:Sample NM:i:3 AS:i:46 + XS:i:46 RX:Z:GAATAAAC CLUSTER chr1:824050 0 AAATTAGC 84 47 NB501670:42:HJL7WAFXX:4:11 +610:4671:5743 83 chr1 824050 60 37S47M = 823980 + -117 GGGTGGGTGGGAACGGCGACTGGGTGGGTGAGCGGGCGGGAGGGAGGAAAAGAAAGAG +AGAAAGGTGAAAGGTGGGGACGGGAA E/E/EE</E//////////A/AE//EE//A/6/EEE/EE +//EEEEE<EA<EEEEEEEE/<EEAEEAEE/EEEEEEEEAAEEEEE MD:Z:47 RG:Z:Samp +le NM:i:0 AS:i:47 XS:i:34 RX:Z:AAATTAGC chr1:565747 0 AAATTAGG 127 110 NB501670:42:HJL7WAFXX:4: +21612:6144:5953 83 chr1 565747 0 17S110M = 56574 +7 -110 GTTCAAACCGCCAGGAGTAATTCCATCCACCCTCCTCTCCCTAGGAGGCCTGCCCC +CGCTAACCGGCTTTTTGCCCAAATGGGCCATTATCGAAGAATTCACAAAAAACAATAGCCTCATCATCC +CC EEAEEEEEE<EEEEEEEEEEAEEEEEAAAEEEEEEEEEEEAAE/AEAE<E<AAAEA/E<EEEE +AEEEA6EEEEEEEEEEEEEE<EAEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE X +A:Z:chrM,-5198,17S110M,0; MD:Z:110 RG:Z:Sample NM:i:0 AS: +i:110 XS:i:110 RX:Z:AAATTAGG chr1:1578228 5 ATATTATC 129 32C45T9G14C15T9 NB501670:42 +:HJL7WAFXX:1:21305:8626:20091 99 chr1 1578228 0 129M + = 1578233 148 GAGGTGCTCTGGAGTCTACTGAAGGTTTGCAAATTCAGGGGGAA +TCTTGGAGAGTAAACTGTGATTCATTAATCAACGCCACCGCTTCTCACATTAGTGGCTCACACCTCACT +CCCCGCAGGCAGGCAG E/6E///AE/E///E//E/EE///E/<EE/E//EE///EE/AE/E/EE6 +/////AE/EE/E</E///EE6AEEEE//E////E/AA<6/6E//EEE/A/EE6E/A//EEAE///EE/< +/////A/AA/< XA:Z:chr1,+1641438,129M,5; MD:Z:32C45T9G14C15T9 +RG:Z:Sample NM:i:5 AS:i:104 XS:i:104 RX:Z:ATATTATC CLUSTER chr1:116740 0 AACCAAGT 130 130 NB501670:42:HJL7WAFXX:4: +11602:6597:2684 83 chr1 116740 0 130M = 116716 + -154 ATTGCTCTTGCCTGTCCTTCAAGTCTATTCTTAAATGTCCCATTCTCTGTGAAGCTTTC +CTGCCCACCCTATTTAAATTACAGACTTCACTCCCAATTCCCCATCTACTTTAAGAGTCTTCATTTATC +AT <AAA/E/EAA<EEAEE<EEAAA<EEAEEEAEA<EEEEEE<EEEEEEEEEE/AE6EEEEEEEE/ +/EEEA/EAEEEEEEEEEEEEEEEEE/EEEEEEEEEE/E//EEEEEEEEEAEEEEAEE//EEAEEEEE + MD:Z:130 RG:Z:Sample NM:i:0 AS:i:130 XS:i:130 RX:Z:A +ACCAAGT chr1:1401547 1 ACCCACGT 90 12T72 NB501670:42:HJL7WAFXX: +1:21212:19912:16100 83 chr1 1401547 60 5S85M = 1 +401547 -85 TCAGAGGCAAGCAGAGGCTGCGGTGAGCCGAGATCCTGCCATTGCACCCCAG +CCTGGGCAAGAAGAGCAAACTTCTGTCTCAAAAAAAAA <A6/A<EEEEAEA<E<E<EE//AE/A6 +EAEEEEEEAEEEEEAEEEEEEEEEEEEAAEEEEEEEEEEAEEEEEEEEEAEEEEEEEEEEEEE MD +:Z:12T72 RG:Z:Sample NM:i:1 AS:i:80 XS:i:37 RX:Z:ACCCA +CGT CLUSTER chr1:1275215 2 AACCGTAA 60 49A7^A3 NB501670:42:HJL7WAFX +X:4:21512:21580:5204 99 chr1 1275215 60 57M1D3M = + 1275322 135 GGTGCAGGCAGGATGTGCAGCTCAGTCCACCGCCCCCGCAGACCCACCC +GCAGCCGCTGT EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EAAEEE/EAE/E<A/EAAE +EAE//< MD:Z:49A7^A3 RG:Z:Sample NM:i:2 AS:i:52 XS:i:26 + RX:Z:AACCGTAA chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:2:1 +1309:10764:17151 99 chr1 1072050 60 92M6S = 1072 +050 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTT +CATTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT EEEEEEEEEEEEEEEAEEEEEEE +EEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEAAEAEEEE<E/EEAEEAEEEEEEEEEEEEEEEEEE<E +EEAEAA MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX +:Z:AAGCGCAA chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:3:1 +1402:26720:3870 99 chr1 1072050 60 92M6S = 10720 +50 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTTC +ATTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT AAEEEAEAAEEEE/EEEEEE/EEE +EEAEEEEEEEAEEEEEEEEEEEEEEAEAEEEEEEEEEEEAEE/EEE/EAEEEEEAAEEAEEEEEEEEEE +AEEE< MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX: +Z:AAGCGCAA chr1:1072050 0 AAGCGCAA 98 92 NB501670:42:HJL7WAFXX:3:1 +1403:1261:3853 99 chr1 1072050 60 92M6S = 107205 +0 92 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACTTCA +TTTGCCCACGTTAATTGGGGCGCAGGTGCAGCTCGCGGAT EEE/<EEEAEEEE/EEE6EE/EEEE +EEEAEE</<EEAE//AEEEE6AAA6/A/A<AAE//<EEEAAEE/EEEAA/EAEEEEAEE/EE/AE<<A6 +EA<A MD:Z:92 RG:Z:Sample NM:i:0 AS:i:92 XS:i:0 RX:Z +:AAGCGCAA chr1:1072050 0 AAGCGCAA 83 67 NB501670:42:HJL7WAFXX:2:2 +1307:15105:17182 99 chr1 1072050 60 67M16S = 107 +2050 67 CGGGCCGCCTGGCACACAGGAGGGCGGTTCCTTTCCTGTTGGACCCGGTCTCACT +TCATTTGCCCACATCGTTGTAAGCCTTA EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE +EEEEEEEEEEEEEEE/EEEEEEEAEEEEEAEEEEAEEEEAEE/<<E MD:Z:67 RG:Z:Sam +ple NM:i:0 AS:i:67 XS:i:19 RX:Z:AAGCGCAA CLUSTER chr1:930467 3 AAGACGGT 130 66C33A12A16 NB501670:42:HJL7 +WAFXX:1:21101:11064:7087 83 chr1 930467 60 130M = + 930292 -305 AGCCTGTAATCCCAGCACTTTGGGAGGCCAAGACAGGCAGATCACTTGA +GGTCAGAAGTTCGAGACGAGCCTAGCTTCAACAAGGTGAAACCCCGTCTCTGCTAAAAATACAAGAATT +AGCCAGGCACGA AAAEAAEEEEAEEEEEEEEEEAEEEA<AEEEEAEEEEEEEAEEEEEEEAEE/E +EEAEE/EEEAEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE +EEEEEEEE MD:Z:66C33A12A16 RG:Z:Sample NM:i:3 AS:i:115 +XS:i:79 RX:Z:AAGACGGT chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1 +1101:17835:9283 83 chr1 1714005 60 4S60M = 17140 +05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG +CGCAAGCG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEE +AEEEAEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R +X:Z:AATGCGGT chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1 +1103:25947:7125 83 chr1 1714005 60 4S60M = 17140 +05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG +CGCAAGCG EAAEEEEEEEEEEE6/EEEAEAE6EEAEEEEE//EEEEE/EEEEEEAEEAAEEEEEA +EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R +X:Z:AATGCGGT chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1 +1201:14812:3768 83 chr1 1714005 60 4S60M = 17140 +05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG +CGCAAGCG AEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEE +EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R +X:Z:AATGCGGT chr1:1714005 0 AATGCGGT 64 60 NB501670:42:HJL7WAFXX:1:1 +1204:12625:5665 83 chr1 1714005 60 4S60M = 17140 +05 -60 ATCCGCGAGCTGCAGGTCACTCCACTGCCTGTGTCCACCTGCGACAGGTGCGCCCG +CGCAAGCG EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE +EEEEEEE MD:Z:60 RG:Z:Sample NM:i:0 AS:i:60 XS:i:0 R +X:Z:AATGCGGT
Re^2: Find duplicate based on specific fields while allowing 2 mismatch
by amitgsir (Novice) on Aug 29, 2017 at 00:47 UTC

    Hi, thanks for your time! It is providing the expected output.

    Sorry I made the post without login and missed the formating for the output. I Can't modify it so I am posting here again.

    Finally, I am wordering about two more thing:

    1. How to get the sorted results by col1. When I tested on large dataset output is not sorted.

    2. How to print additional column after col.4 in output results. There may be more columns in the line after these 4 important columns. Also, #of clumns is not fixed in each line. So, after first column some lines may have 15 columns while some might have upto ~20.

    For example: When I am passing the below data as input:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1198199]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-03-28 22:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found