G'day rnaeye,
I'm a tad late to the party on this one: I've been indisposed and haven't logged in for about a week.
What follows is a series of techniques; some or all may be useful for you.
I've basically dealt with this in two phases: the extraction of the data and the reporting of results.
Both of these take into consideration your desire to deal with the data in chunks
and output as a single file or with multiple files (one per chunk).
I'll show the code first.
Notes follow which explain what I did differently and why.
#!/usr/bin/env perl
use 5.28.0;
use warnings;
use autodie;
use Text::CSV;
use constant {
SEP_CHAR => "\t",
BED_CHR => 0,
BED_START => 1,
BED_END => 2,
BED_STRAND => 5,
BED_LINE => 6,
};
my (%start_plus, %start_minus, %start_dud);
my %dispatch = (
'+' => sub { ++$start_plus{$_[BED_CHR]}{$_[BED_START]} },
'-' => sub { ++$start_minus{$_[BED_CHR]}{$_[BED_END]} },
'0' => sub { $start_dud{$_[BED_CHR]}{$_[BED_LINE]} = $_[BED_STRAND
+] },
);
my $csv = Text::CSV::->new({sep_char => SEP_CHAR});
while (my $row = $csv->getline(\*DATA)) {
next if 0 == index $row->[0], '#';
my $sign = exists $dispatch{$row->[BED_STRAND]} ? $row->[BED_STRAN
+D] : 0;
$dispatch{$sign}->(@$row, $.);
}
# This part for testing and demonstration purposes only
use Data::Dump;
say '=' x 10, ' Raw Extracted Data ', '=' x 20;
say 'Plus Strands';
say '-' x 50;
dd \%start_plus;
say '-' x 50;
say 'Minus Strands';
say '-' x 50;
dd \%start_minus;
say '-' x 50;
say 'Problem Strands';
say '-' x 50;
dd \%start_dud;
say '=' x 50;
# End this testing & demo part
my %filehandle_for;
for my $chr (sort keys %start_plus) {
for my $file_id (ALL => $chr) {
my $fh = _get_results_fh(\%filehandle_for, $file_id);
$fh->say(join SEP_CHAR, CHR => 'Plus Strand BedGraph:');
for my $key (sort {$a <=> $b} keys %{$start_plus{$chr}}) {
$fh->say(join SEP_CHAR, $chr, $key + 1, $start_plus{$chr}{
+$key});
}
}
}
# This part for testing and demonstration purposes only
say "\n", '=' x 10, ' Output Files ', '=' x 26;
for my $outfile (qw{
pm_11135183_bio_bed_modified_chr7_results.bed
pm_11135183_bio_bed_modified_chr8_results.bed
pm_11135183_bio_bed_modified_ALL_results.bed
}) {
say $outfile;
say '-' x 50;
system cat => $outfile;
say '-' x 50;
}
# End this testing & demo part
sub _get_results_fh {
my ($fh_for, $chr) = @_;
return $fh_for->{$chr} || _gen_results_fh($fh_for, $chr);
}
sub _gen_results_fh {
my ($fh_for, $chr) = @_;
my $fmt = 'pm_11135183_bio_bed_modified_%s_results.bed';
my $filename = sprintf $fmt, $chr;
open my $fh, '>', $filename;
$fh_for->{$chr} = $fh;
return $fh;
}
__DATA__
#chromosome start end position col-5 strand
chr7 127471196 127472363 Pos1 0 @
chr7 127471196 127472363 Pos1 0 +
chr7 127471196 127472363 Pos1 0 +
chr7 127471196 127472363 Pos1 0 +
chr7 127471196 127472363 Pos1 0 +
chr7 127471196 127472363 Pos1 0 +
chr7 127472363 127473530 Pos2 0 +
chr7 127473530 127474697 Pos3 0 -
chr7 127474697 127475864 Pos4 0 -
chr7 127474697 127475864 Pos4 0 _
chr8 40000 40500 Pos5 0 +
chr8 40000 40600 Pos5 0 +
chr8 40000 40650 Pos5 0 =
chr8 40000 40650 Pos5 0 +
chr8 41000 41200 Pos6 0 -
Here's the output from a sample run:
$ ./pm_11135183_bio_bed_modified.pl
========== Raw Extracted Data ====================
Plus Strands
--------------------------------------------------
{ chr7 => { 127471196 => 5, 127472363 => 1 }, chr8 => { 40000 => 3 } }
--------------------------------------------------
Minus Strands
--------------------------------------------------
{ chr7 => { 127474697 => 1, 127475864 => 1 }, chr8 => { 41200 => 1 } }
--------------------------------------------------
Problem Strands
--------------------------------------------------
{ chr7 => { 2 => "\@", 11 => "_" }, chr8 => { 14 => "=" } }
==================================================
========== Output Files ==========================
pm_11135183_bio_bed_modified_chr7_results.bed
--------------------------------------------------
CHR Plus Strand BedGraph:
chr7 127471197 5
chr7 127472364 1
--------------------------------------------------
pm_11135183_bio_bed_modified_chr8_results.bed
--------------------------------------------------
CHR Plus Strand BedGraph:
chr8 40001 3
--------------------------------------------------
pm_11135183_bio_bed_modified_ALL_results.bed
--------------------------------------------------
CHR Plus Strand BedGraph:
chr7 127471197 5
chr7 127472364 1
CHR Plus Strand BedGraph:
chr8 40001 3
--------------------------------------------------
Notes:
-
I've used the warnings pragma and strongly recommend you do the same in all Perl programs.
-
The "use 5.28.0;" automatically gives you the strict pragma.
In the absence of a "use VERSION;" (>=5.12) I'd also strongly recommend explicitly including strict;
of course, you may already know this.
-
I've used the autodie pragma.
This takes the tedium out of hand-crafting I/O exception messages; it's also far less error prone
— note that your message, "Cannot open file." doesn't tell the user what file had that problem
(autodie will do that for you).
-
Next, you'll see Text::CSV.
Although the CSV abbreviation stands for "comma-separated values",
this module can be used with other separator characters.
Notice that I used the constant SEP_CHAR;
if the separator ever changes, there's only one thing to change in the code to accommodate such a change.
Text::CSV allows this with "{sep_char => SEP_CHAR}".
One final point, if you also have Text::CSV_XS installed it will run much faster
(but that's not a requirement).
-
I've included a dispatch table (%dispatch) to handle your processing generically.
Notice a little further down that my while loop iterating DATA only needs three statements.
-
I've shown the data extracted. See Data::Dump if you're unfamiliar with that module.
-
To handle the potential output to multiple files, I've used a hash (%filehandle_for) to cache the filehandles needed.
They only need to be opened once; only those that are needed are opened.
The code is fairly straightforward:
_get_results_fh() will return a cached filehandle or, if it hasn't been created yet,
will get a new one via _gen_results_fh().
-
You may have noticed that I've used "$fh->say(...)" instead of "say $fh (...)".
See "perlobj: Indirect Object Syntax" for more about that.
-
You didn't provide any indications about how the output should look when split across multiple files.
I just made some guesses and provided alternatives: I expect you'll want to make changes there.
If you want to give users choices about the output,
take a look at the core module, Getopt::Long.
-
I added a few extra lines to __DATA__ for testing purposes.
Some represent possible typos, e.g. '_' instead of '-';
consider doing this sort of thing in all of your code.
|