Re: Sorting Data By Overlapping Intervals

I would propose two modifications. First, when you load the data from file, already extract the fourth column and store it alongside the lines:

my @SNPs = map { [ (split /\t/)[3], $_ ] } <CG>;
[download]

So each element of @SNPs is now an array reference, whose first element is the fourth column and the second element is the full line.

As the second change, in your loop over the intervals pick all elements that fall in this interval using grep and the extract the line from the array reference using map:

my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] 
+<= $end } @SNPs;
[download]

All you need then is to print these lines into the relevant file.

I am not sure whether I explain this well...

Comment on Re: Sorting Data By Overlapping Intervals Select or Download Code

Replies are listed 'Best First'.
Re^2: Sorting Data By Overlapping Intervals by ccelt09 (Sexton) on Oct 31, 2013 at 10:59 UTC
the logic behind this makes sense but once I have each element of `@SNPs` stored as an array reference as you explained above i don't understand how to print those falling within the ranges in my second data set to a relevant file	[reply] [d/l]
Re^3: Sorting Data By Overlapping Intervals by hdb (Monsignor) on Oct 31, 2013 at 11:20 UTC
This is what my second proposal does. If you have the interval boundaries in variables `$start` and `$end`, then `my @inInterval = map { $_->[1] } grep { $start <= $_->[0] and $_->[0] +<= $end } @SNPs;` [download] will filter all relevant lines for this interval. You would just `print OUT @inInterval;` where `OUT` is the file handle for the file corresponding to this interval. Something like this: `open my $CG, "<", $cg_input or die "can't open $cg_input\n"; my @SNPs = map { [ (split /\t/)[3], $_ ] } <$CG>; close($CG); open my $INTERVAL, "<", $input_interval or die "can't open $input_inte +rval\n"; my $interval = <$INTERVAL>; # skip first line foreach (<$INTERVAL>){ chomp; my( $start, $end ) = split /\t/; open my $OUT, ">", $output_directory."temp_file_".$count++.".txt"; + print $OUT map { $_->[1] } grep { $start <= $_->[0] and $_->[0] <= + $end } @SNPs; close $OUT; } close($INTERVAL);` [download]	[reply] [d/l] [select]


P is for Practical
	PerlMonks