My general approach in this type of situation is to build a hash, or more specifically a hash of lists. I am not familiar with your particular input, and figuring out the patterns and how to parse them is usually where most of the effort in writing a script of this type falls. Taking the input from your original post:
HWUSI-EAS95L_0025_FC:3:1:5232:1082#0/1 - 1449586 1449619
HWUSI-EAS95L_0025_FC:3:1:5232:1082#0/2 - 1449544 1449577
HWUSI-EAS95L_0025_FC:3:1:6417:1078#0/1 - 4744083 4744113
HWUSI-EAS95L_0025_FC:3:1:6539:1083#0/1 - 4867122 4867157
HWUSI-EAS95L_0025_FC:3:1:6539:1083#0/2 - 4866942 4866977
HWUSI-EAS95L_0025_FC:3:1:10260:1083#0/1 + 1930232 1930266
HWUSI-EAS95L_0025_FC:3:1:10260:1083#0/2 + 1930354 1930389
And fitting the most general pattern that seems to match, I would use code like
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my @result;
my $csv = Text::CSV->new ( { sep_char => "\t" } ) # should set binary
+ attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
my %result;
while ( my $row = $csv->getline( *DATA ) ) {
my ($key, $index) = $row->[0] =~ m{^(.+)/([12])$}
or die "Line did not match pattern: @$row";
if ($index == 1) {
$result{$key}[0] = $row->[2];
} elsif ($index == 2) {
$result{$key}[1] = $row->[2];
} else {
die "Index was not 1 or 2: @$row"
}
}
$csv->eof or $csv->error_diag();
# Output results:
for my $key (keys %result) {
next unless $result{$key}[1];
print "$key:\t$result{$key}[0]\t$result{$key}[1]\n";
}
__DATA__
HWUSI-EAS95L_0025_FC:3:1:5232:1082#0/1 - 1449586 1449619
HWUSI-EAS95L_0025_FC:3:1:5232:1082#0/2 - 1449544 1449577
HWUSI-EAS95L_0025_FC:3:1:6417:1078#0/1 - 4744083 4744113
HWUSI-EAS95L_0025_FC:3:1:6539:1083#0/1 - 4867122 4867157
HWUSI-EAS95L_0025_FC:3:1:6539:1083#0/2 - 4866942 4866977
HWUSI-EAS95L_0025_FC:3:1:10260:1083#0/1 + 1930232 1930266
HWUSI-EAS95L_0025_FC:3:1:10260:1083#0/2 + 1930354 1930389
to get the output
HWUSI-EAS95L_0025_FC:3:1:6539:1083#0: 4867122 4866942
HWUSI-EAS95L_0025_FC:3:1:5232:1082#0: 1449586 1449544
HWUSI-EAS95L_0025_FC:3:1:10260:1083#0: 1930232 1930354
Note that because of how tabs get mangled on this site, you'll need to click the download link in order to get the proper formatting in your clipboard. |