How to process a BED file as chunk of lines based on chromosome names

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have written a script to calculate the frequency(=coverage) of start sites on DNA strands. The start position is located in column 2 when strand is (+), and located in column 3 when strand is (-). And coverage/frequency is stored in two different hashes: one for each strand.

 
#!/usr/bin/perl
use 5.28.0;
my $lineCounter = 1;
my %startPoints;
my %startPoints_plusStrand;
my %startPoints_minusStrand;
my ($chr, $start, $end, $strand);

open (my $OUT_FH,  ">", "results.bed" ) or die "Cannot open file.\n $!
+";

# Count frequency of start sited in the column-2.
while(<DATA>){
    next if /^#/; #ignore comments in the input file 
    chomp (my $line = $_);
    ($chr, $start, $end, $strand) = (split /\t/, $line)[0,1,2,5];
    my $startPoint;
    if($strand eq '+'){
            $startPoint = $start;
            $startPoints_plusStrand{$startPoint}++;
    } elsif ($strand eq '-') {
            $startPoint = $end;
            $startPoints_minusStrand{$startPoint}++;
    } else {
            die "Goofy BED Data on line $lineCounter:\n$_";
    }
}

say $OUT_FH "#Plus Strand BedGraph:";
for my $key (sort {$a <=> $b} keys %startPoints_plusStrand) {
        say $OUT_FH $key+1, "\t","+","$startPoints_plusStrand{$key}";
}
__DATA__
#chromosome    start    end    position    col-5    strand
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127471196    127472363    Pos1    0    +
chr7    127472363    127473530    Pos2    0    +
chr7    127473530    127474697    Pos3    0    -
chr7    127474697    127475864    Pos4    0    -
#chr8    40000    40500    Pos5    0    +
#chr8    40000    40600    Pos5    0    +
#chr8    40000    40650    Pos5    0    +
#chr8    41000    41200    Pos6    0    -
[download]

If I were doing this for single chromosome, it would work fine (chromosome 7 in my DATA). However, I want to do this for a .BED file containing all 23 human chromosomes, chr1, chr2, chr3 ...... What I want to do is that to calculate coverages for each chromosome and write the result into a single output file or multiple output files like chr1-covrage.txt, chr2-covrage.txt.

My input file is larger than 1 GB. And, example data could be expanded to multiple chrosomes by uncommenting the last 4 lines in the __DATA__ section, which gives chr and chr8. Like:

chr7    127471196    127472363    Pos1    0    +
chr8    40000    40500    Pos5    0    +
[download]

I would appreciate any suggestions on how I could process the file as chunks (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, ...etc) and write the results a single file or multiple files.

Comment on How to process a BED file as chunk of lines based on chromosome names Select or Download Code

Replies are listed 'Best First'.
Re: How to process a BED file as chunk of lines based on chromosome names by GrandFather (Saint) on Jul 20, 2021 at 01:34 UTC
Rather than using manifest variables for accumulating results you can use a hash keyed by chromosome: use strict; use warnings; my %chroms; # Count frequency of start sited in the column-2. while (<DATA>) { next if /^#/; #ignore comments in the input file chomp(my $line = $_); my ($chr, $start, $end, $strand) = (split /\t/, $line)[0, 1, 2, 5] +; my $startPoint; if ($strand eq '+') { $startPoint = $start; $chroms{$chr}{startPoints_plusStrand}{$startPoint}++; } elsif ($strand eq '-') { $startPoint = $end; $chroms{$chr}{startPoints_minusStrand}{$startPoint}++; } else { die "Goofy BED Data on line $.:\n$_"; } } print "#Plus Strand BedGraph:\n"; for my $chromKey (keys %chroms) { my $chrom = $chroms{$chromKey}; print "--- $chromKey\n"; for my $key (sort {$a <=> $b} keys %{$chrom->{startPoints_plusStra +nd}}) { print $key + 1, "\t", "+", "$chrom->{startPoints_plusStrand}{$ +key}\n"; } } __DATA__ #chromosome start end position col-5 strand chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 - chr7 127474697 127475864 Pos4 0 - chr8 40000 40500 Pos5 0 + chr8 40000 40600 Pos5 0 + chr8 40000 40650 Pos5 0 + chr8 41000 41200 Pos6 0 - [download] Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond	[reply] [d/l]
Re: How to process a BED file as chunk of lines based on chromosome names by kcott (Archbishop) on Jul 24, 2021 at 16:42 UTC
G'day rnaeye, I'm a tad late to the party on this one: I've been indisposed and haven't logged in for about a week. What follows is a series of techniques; some or all may be useful for you. I've basically dealt with this in two phases: the extraction of the data and the reporting of results. Both of these take into consideration your desire to deal with the data in chunks and output as a single file or with multiple files (one per chunk). I'll show the code first. Notes follow which explain what I did differently and why. #!/usr/bin/env perl use 5.28.0; use warnings; use autodie; use Text::CSV; use constant { SEP_CHAR => "\t", BED_CHR => 0, BED_START => 1, BED_END => 2, BED_STRAND => 5, BED_LINE => 6, }; my (%start_plus, %start_minus, %start_dud); my %dispatch = ( '+' => sub { ++$start_plus{$_[BED_CHR]}{$_[BED_START]} }, '-' => sub { ++$start_minus{$_[BED_CHR]}{$_[BED_END]} }, '0' => sub { $start_dud{$_[BED_CHR]}{$_[BED_LINE]} = $_[BED_STRAND +] }, ); my $csv = Text::CSV::->new({sep_char => SEP_CHAR}); while (my $row = $csv->getline(\DATA)) { next if 0 == index $row->[0], '#'; my $sign = exists $dispatch{$row->[BED_STRAND]} ? $row->[BED_STRAN +D] : 0; $dispatch{$sign}->(@$row, $.); } # This part for testing and demonstration purposes only use Data::Dump; say '=' x 10, ' Raw Extracted Data ', '=' x 20; say 'Plus Strands'; say '-' x 50; dd \%start_plus; say '-' x 50; say 'Minus Strands'; say '-' x 50; dd \%start_minus; say '-' x 50; say 'Problem Strands'; say '-' x 50; dd \%start_dud; say '=' x 50; # End this testing & demo part my %filehandle_for; for my $chr (sort keys %start_plus) { for my $file_id (ALL => $chr) { my $fh = _get_results_fh(\%filehandle_for, $file_id); $fh->say(join SEP_CHAR, CHR => 'Plus Strand BedGraph:'); for my $key (sort {$a <=> $b} keys %{$start_plus{$chr}}) { $fh->say(join SEP_CHAR, $chr, $key + 1, $start_plus{$chr}{ +$key}); } } } # This part for testing and demonstration purposes only say "\n", '=' x 10, ' Output Files ', '=' x 26; for my $outfile (qw{ pm_11135183_bio_bed_modified_chr7_results.bed pm_11135183_bio_bed_modified_chr8_results.bed pm_11135183_bio_bed_modified_ALL_results.bed }) { say $outfile; say '-' x 50; system cat => $outfile; say '-' x 50; } # End this testing & demo part sub _get_results_fh { my ($fh_for, $chr) = @_; return $fh_for->{$chr} \|\| _gen_results_fh($fh_for, $chr); } sub _gen_results_fh { my ($fh_for, $chr) = @_; my $fmt = 'pm_11135183_bio_bed_modified_%s_results.bed'; my $filename = sprintf $fmt, $chr; open my $fh, '>', $filename; $fh_for->{$chr} = $fh; return $fh; } __DATA__ #chromosome start end position col-5 strand chr7 127471196 127472363 Pos1 0 @ chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127471196 127472363 Pos1 0 + chr7 127472363 127473530 Pos2 0 + chr7 127473530 127474697 Pos3 0 - chr7 127474697 127475864 Pos4 0 - chr7 127474697 127475864 Pos4 0 _ chr8 40000 40500 Pos5 0 + chr8 40000 40600 Pos5 0 + chr8 40000 40650 Pos5 0 = chr8 40000 40650 Pos5 0 + chr8 41000 41200 Pos6 0 - [download] Here's the output from a sample run: $ ./pm_11135183_bio_bed_modified.pl ========== Raw Extracted Data ==================== Plus Strands -------------------------------------------------- { chr7 => { 127471196 => 5, 127472363 => 1 }, chr8 => { 40000 => 3 } } -------------------------------------------------- Minus Strands -------------------------------------------------- { chr7 => { 127474697 => 1, 127475864 => 1 }, chr8 => { 41200 => 1 } } -------------------------------------------------- Problem Strands -------------------------------------------------- { chr7 => { 2 => "\@", 11 => "_" }, chr8 => { 14 => "=" } } ================================================== ========== Output Files ========================== pm_11135183_bio_bed_modified_chr7_results.bed -------------------------------------------------- CHR Plus Strand BedGraph: chr7 127471197 5 chr7 127472364 1 -------------------------------------------------- pm_11135183_bio_bed_modified_chr8_results.bed -------------------------------------------------- CHR Plus Strand BedGraph: chr8 40001 3 -------------------------------------------------- pm_11135183_bio_bed_modified_ALL_results.bed -------------------------------------------------- CHR Plus Strand BedGraph: chr7 127471197 5 chr7 127472364 1 CHR Plus Strand BedGraph: chr8 40001 3 -------------------------------------------------- [download] Notes: I've used the warnings pragma and strongly recommend you do the same in all Perl programs. The "`use 5.28.0;`" automatically gives you the strict pragma. In the absence of a "`use VERSION;`" (>=5.12) I'd also strongly recommend explicitly including `strict`; of course, you may already know this. I've used the autodie pragma. This takes the tedium out of hand-crafting I/O exception messages; it's also far less error prone — note that your message, "`Cannot open file.`"* doesn't tell the user what file had that problem (`autodie` will do that for you). Next, you'll see Text::CSV. Although the CSV abbreviation stands for "comma-separated values", this module can be used with other separator characters. Notice that I used the constant `SEP_CHAR`; if the separator ever changes, there's only one thing to change in the code to accommodate such a change. `Text::CSV` allows this with "`{sep_char => SEP_CHAR}`". One final point, if you also have Text::CSV_XS installed it will run much faster (but that's not a requirement). I've included a dispatch table (`%dispatch`) to handle your processing generically. Notice a little further down that my `while` loop iterating `DATA` only needs three statements. I've shown the data extracted. See Data::Dump if you're unfamiliar with that module. To handle the potential output to multiple files, I've used a hash (`%filehandle_for`) to cache the filehandles needed. They only need to be opened once; only those that are needed are opened. The code is fairly straightforward: `_get_results_fh()` will return a cached filehandle or, if it hasn't been created yet, will get a new one via `_gen_results_fh()`. While writing these notes, it occurs to me that this could be improved such that the hashref doesn't need to be passed with every invocation. Possibly something like this which keeps the hash private to the two routines that need it and simplifies all of the calling code: `{ my %filehandle_for; sub _get_results_fh { my ($chr) = @_; ... } sub _gen_results_fh { my ($chr) = @_; ... } }` [download] You may have noticed that I've used "`$fh->say(...)`" instead of "`say $fh (...)`". See "perlobj: Indirect Object Syntax" for more about that. You didn't provide any indications about how the output should look when split across multiple files. I just made some guesses and provided alternatives: I expect you'll want to make changes there. If you want to give users choices about the output, take a look at the core module, Getopt::Long. I added a few extra lines to `__DATA__` for testing purposes. Some represent possible typos, e.g. '`_`' instead of '`-`'; consider doing this sort of thing in all of your code. — Ken	[reply] [d/l] [select]


Keep It Simple, Stupid
	PerlMonks