Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?

by pickleswarlz (Initiate)
on Feb 22, 2013 at 22:27 UTC ( [id://1020241]=perlquestion: print w/replies, xml ) Need Help??

pickleswarlz has asked for the wisdom of the Perl Monks concerning the following question:

I am pretty new to Perl, but have what I think is a complex problem that is way beyond my abilities.

I have one .csv file that contains the locations of genes in the genome that looks like this:

Chromosome Start End chr1 220 250 chr1 259 298 chr2 225 263 chr3 310 344 [...]

where the first column indicates the chromosome number that the locations map to, second column is the start position of the gene in the genome, and the third column is the end position of the gene in the genome.

I have a second .csv file that gives a set of other locations in the genome with identification numbers:

ID# position ID1 214 ID2 222 ID3 258 [...]

where the first column is the identifier (alphanumeric), and the second column is the position in the genome.

What I would like to do is find the two locations from the second file that are closest to either side of location given in the first file, and output these to a new file. For example, for this line in .csv file 1:

chr1   220   250

search CSV file 2 for the two closest values on either side of 220 AND the two closest values on either side of 250, given the following output .csv file:

Chromosome Start End Start(-) Start(+) End (-) End (+) chr1 220 250 214 222 222 258

I'm a Windows user, and relatively competent with the command line. I'll have to run this Perl script on multiple files, so I'd love to be able to specify the input and output files in the command line as I go rather than specifying them in the script itself. The files containing the ID information is huge, around 500,000 lines, so I'm wondering if it might be better to transform that file into a hash first.

Thank you!

Replies are listed 'Best First'.
Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?
by choroba (Cardinal) on Feb 23, 2013 at 01:44 UTC
    No need for a hash, an array should be enough. Try the following:
    #!/usr/bin/perl use warnings; use strict; open my $GENES, '<', 'genes' or die $!; open my $LOCATIONS, '<', 'locations' or die $!; chomp(my @locations = map { (split ' ')[1] } <$LOCATIONS>); # If IDs are not already sorted, uncomment the following line: # @locations = sort { $a <=> $b } @locations; for (<$GENES>) { my ($chromosome, $start, $end) = split ' '; print "$chromosome\t$start\t$end"; my $idx = 0; # For $end, start searching where you left for + $start. my $correction = 0; # Needed for Start(-) == Start and End(+) == E +nd. for my $pos ($start, $end) { $idx++ while $locations[$idx] <= $pos - $correction and $idx <= $#locations; die "No numbers around $pos ($idx)\n" if $idx == 0 or $idx > $#locations; print "\t$locations[$idx-1]\t$locations[$idx]"; $correction = 1; } print "\n"; }
    Update: Fixed border cases.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

      Thank you very much! I am using the first suggestion, as I don't really understand the binary search yet, and so far it doesn't seem to have problems with searching the 500,000 rows.

      I am now running into the problem of the results being output on different lines in the resulting CSV file for each print command. Here is my code:

      #!/usr/bin/perl use warnings; use strict; open my $GENES, '<', 'chr1data.csv' or die $!; open my $LOCATIONS, '<', 'chr1snps.csv' or die $!; chomp(my @locations = map { (split ',')[2] } <$LOCATIONS>); # If IDs are not already sorted, uncomment the following line: # @locations = sort { $a <=> $b } @locations; for (<$GENES>) { my ($chromosome, $start, $end) = split ','; print "$chromosome,$start,$end"; my $idx = 0; # For $end, start searching where you left for + $start. my $correction = 0; # Needed for Start(-) == Start and End(+) == E +nd. for my $pos ($start, $end) { $idx++ while $locations[$idx] <= $pos - $correction and $idx <= $#locations; die "No numbers around $pos ($idx) \n" if $idx == 0 or $idx > $#locations; print ",$locations[$idx-1],$locations[$idx]"; $correction = 1; } print "\n"; }

      Printing print ",$locations[$idx-1],$locations[$idx]"; puts this information on a new line. I'd like it to come out on the same line as print "$chromosome,$start,$end"; for each search. Do I have a \n in the wrong place?

        Your $end probably contains newline (I split on ' ', which removed it, you split on a comma). Just
        chomp $end;
        before printing it.
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?
by 7stud (Deacon) on Feb 23, 2013 at 22:22 UTC

    I would think a binary search of the array would be a lot more efficient than a linear pass when n=500,000, which might take on average 250,000 comparisons to locate the range.

    pickleswarlz, If the term 'binary search' scares you, there is an exercise in "Learning Perl" where you write a program that picks a number between 0-100 and then invites a user to guess the number. If the user starts off by picking 50, then the program should respond by saying 'higher' or 'lower'. What is the quickest way to guess the number that the program picked? Well, if the computer says higher, you should pick 75 next. Then if the computer says lower, you should pick 62 or 63 next. Do you see how halving the range each time will arrive at guessing the correct number the quickest? Try it with a friend. It will only take you 4-5 tries to guess the number if each time you guess the middle of the range that you know contains the number. That process is known as a 'binary search'.

    The function below uses a binary search to return the (lowest) numbers in an array that bracket a specified number. Once you have all your ID's in a sorted array, you can search the array for the two numbers that bracket a given number:

    sub binary_search { #Called like this: #binary_search(\@data, $#data, $num); my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my ($middle_index, $middle_value); while ($low_index <= $high_index) { $middle_index = int( ($high_index + $low_index)/2 ); $middle_value = $aref->[$middle_index]; print '.'; #print a dot for each attempt to locate target if ($middle_value < $target) { $low_index = $middle_index + 1; } elsif ($middle_value > $target) { $high_index = $middle_index - 1; } elsif ($middle_value == $target) { print "*"; #Indicates the target is actually in @data return $middle_index == 0 ? ($middle_value, $aref->[1]) : ($aref->[$middle_index-1], $middle_value) ; } } while ($aref->[$low_index] >= $target) { --$low_index; } return $aref->[$low_index], $aref->[$low_index+1]; } say ''; my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = binary_search(\@data, $#data, $num); say "$num: @results"; } --output:-- ....*1: 1 5 ....2: 1 5 ....3: 1 5 ....4: 1 5 ...*5: 1 5 ....*6: 5 6 ..*7: 6 7 ....*8: 7 8 ...*9: 8 9 ....*10: 9 10 .....11: 10 20 .....12: 10 20 .....13: 10 20 .....14: 10 20 .....15: 10 20 .....16: 10 20 .....17: 10 20 .....18: 10 20 .....19: 10 20 .....*20: 10 20 .....21: 20 30 .....22: 20 30 .....23: 20 30 .....24: 20 30 .....25: 20 30 .....26: 20 30 .....27: 20 30 .....28: 20 30 .....29: 20 30 .*30: 20 30 ....31: 30 33 ....32: 30 33 ....*33: 30 33 ...*34: 33 34 ....35: 34 38 ....36: 34 38 ....37: 34 38 ....*38: 34 38 ....39: 38 41 ....40: 38 41 ..*41: 38 41 ....42: 41 100 ....43: 41 100 ....44: 41 100 ....45: 41 100 ....46: 41 100 ....47: 41 100 ....48: 41 100 ....49: 41 100 ....50: 41 100 ....51: 41 100 ....52: 41 100 ....53: 41 100 ....54: 41 100 ....55: 41 100 ....56: 41 100 ....57: 41 100 ....58: 41 100 ....59: 41 100 ....60: 41 100 ....61: 41 100 ....62: 41 100 ....63: 41 100 ....64: 41 100 ....65: 41 100 ....66: 41 100 ....67: 41 100 ....68: 41 100 ....69: 41 100 ....70: 41 100 ....71: 41 100 ....72: 41 100 ....73: 41 100 ....74: 41 100 ....75: 41 100 ....76: 41 100 ....77: 41 100 ....78: 41 100 ....79: 41 100 ....80: 41 100 ....81: 41 100 ....82: 41 100 ....83: 41 100 ....84: 41 100 ....85: 41 100 ....86: 41 100 ....87: 41 100 ....88: 41 100 ....89: 41 100 ....90: 41 100 ....91: 41 100 ....92: 41 100 ....93: 41 100 ....94: 41 100 ....95: 41 100 ....96: 41 100 ....97: 41 100 ....98: 41 100 ....99: 41 100 ....*100: 41 100 ...*101: 100 101 ....*102: 101 102 .....103: 102 110 .....104: 102 110 .....105: 102 110 .....106: 102 110 .....107: 102 110 .....108: 102 110 .....109: 102 110 .....*110: 102 110

    And after reading up on the cpan module List::BinarySearch, whose author appears to really know how to construct efficient search algorithms, I would consider using that module's bsearch_num_pos() function, which returns the optimal insertion index for a given number:

     # The following [function returns] an optimal insert point if no match.
    
        my @num_array =   ( 100, 200, 300, 400 );
    
        $index = bsearch_num_pos 500, @num_array; # Returns 4 - Best insert-at position.
    

    Given the index for the insertion, you can get the range the number falls in with $data[$index-1]and $data[$index]. You just have to adjust for the case when the insertion point is 0, like I did in my code.

    use List::BinarySearch qw(bsearch_num_pos); ... ... sub cpan_binary_search { my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my $index = bsearch_num_pos($target, @$aref); return $index == 0 ? ($aref->[0], $aref->[1]) : ($aref->[$index-1], $aref->[$index]); } my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = cpan_binary_search(\@data, $#data, $num); say "$num: @results"; }

    I'll have to run this Perl script on multiple files, so I'd love to be able to specify the input and output files in the command line as I go rather than specifying them in the script itself.

    Any command line arguments can be found in the @ARGV array:

    use strict; use warnings; use 5.012; say for @ARGV; say "\n"; my ($in, $out) = @ARGV; say $in; say $out; --output:-- $ perl prog1.pl infile_name outfile_name 10 20 infile_name outfile_name 10 20 infile_name outfile_name

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1020241]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (5)
As of 2024-03-28 20:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found