Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Comparing FILE1 value to FILE2 range and printing matches

by edwardtickle (Initiate)
on Oct 17, 2014 at 10:07 UTC ( [id://1104164]=perlquestion: print w/replies, xml ) Need Help??

edwardtickle has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I'm very new to Perl and am working on a Bioinformatics project at University. I have FILE1 containing a list of positions, in the format:

99269 550 100 126477 1700

And FILE2 in the format:

517 1878 forward 700 2500 forward 2156 3289 forward 99000 100000 forward 22000 23000 backward

I want to compare every position in FILE1 to every range in values on FILE2, and if a position falls into one of the ranges then i want to print the position, range and direction.

So my expected output would be:

99269 99000 100000 forward 550 517 1878 forward 1700 517 1878 forward

Currently it will run with no errors, however it doesn't output any information so i am unsure where i am going wrong! When i split the final 'if' rule it runs but will only work if the position is on exacly the same line as the range.

Any help would be much appreciated.

I have posted the same question on Stackoverflow as i'm after a fairly urgent answer.

My code is as follows:

#!/usr/bin/perl use strict; use warnings; my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt"; open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt" or die "cannot open > CC22: $!"; open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt" or die "cannot open > CDS: $!"; open (OUTPUTFILE, ">$outputfile") or die "Could not open output file: +$! \n"; while (<FILE1>) { if (/^(\d+)/) { my $CC22 = $1; while (<FILE2>) { if (/^(\d+)\s+(\d+)\s+(\S+)/) { my $CDS1 = $1; my $CDS2 = $2; my $CDS3 = $3; if ($CC22 > $CDS1 && $CC22 < $CDS2) { print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n"; } } } } } close(FILE1); close(FILE2);

Replies are listed 'Best First'.
Re: Comparing FILE1 value to FILE2 range and printing matches
by RichardK (Parson) on Oct 17, 2014 at 12:13 UTC

    Well, one problem is that the first pass of your while loop for FILE2 will consume all the lines in that file and leave the file handle pointing to the end of the file (i.e eof == 1). So that on the next pass there's no more data to be read, and no lines will match.

    A simple fix is to move the open FILE2 inside the loop so that you open it each time you need it.

    while (<FILE1) { ... open (FILE2,'<',"name"); while(<FILE2>) { ... } close FILE2; }

    It isn't very efficient to keep reopening the same file ,and there are lots of better ways but they are more complex, and we would need to know more about your problem. e.g. how big are your files?

    This, Basic debugging checklist , has a number of ways you can try to understand why any code isn't doing what you expect.

    Using autodie saves lots of typing for simple programs like this.

      That's done the trick, thank you for your help! Autodie does make a lot more sense so i will use that in future.
Re: Comparing FILE1 value to FILE2 range and printing matches
by choroba (Cardinal) on Oct 17, 2014 at 10:21 UTC
    Crossposted at StackOverflow. It's considered polite to inform about crossposting so people not attending both sites don't waste their time hacking a solution to a problem already solved at the other end of the internet.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Apologies, i didn't know that, i have edited both posts to include this.
Re: Comparing FILE1 value to FILE2 range and printing matches
by biohisham (Priest) on Oct 18, 2014 at 04:09 UTC

    You can read the positions into an array by themsevels, and then open the second file and iterate over the array to find lines where the positions are enclosed within the ranges. That way you open each file only once

    use strict; use warnings; open(my $fh1, "<","positions.txt") or die("could not open file $!\n"); my @positions; #hold the positions to be compared while(my $line=<$fh1>){ chomp $line; push @positions,$line; } open(my $fh2, "<","coords_orientation.txt") or die("could not open fil +e $!\n"); while(my $line=<$fh2>){ chomp $line; my @record=split(" ",$line); #split the coords_orientation.txt on +white space foreach my $pos (@positions){ if($pos > $record[0] && $pos <$record[1]){ print "$pos @record\n"; } } }

    A 4 year old monk
Re: Comparing FILE1 value to FILE2 range and printing matches
by CountZero (Bishop) on Oct 18, 2014 at 20:02 UTC
    Using some modules:
    use Modern::Perl '2014'; use Number::Interval; use List::Util qw/first/; # FILE1 data emulation my @FILE1 = qw/99269 550 100 126477 1700/; my @interval_objects; while (<DATA>) { chomp; my ($start, $end, undef) = split; push @interval_objects, Number::Interval->new( IncMax => 0, IncMin => 0, Min => $start, Max => $end, ); } for my $datapoint (@FILE1) { my $found = first {$_->contains($datapoint)} @interval_objects; say "$datapoint is in $found" if $found; } # FILE2 data emulation __DATA__ 517 1878 forward 700 2500 forward 2156 3289 forward 99000 100000 forward 22000 23000 backward
    Output:
    99269 is in (99000,100000) 550 is in (517,1878) 1700 is in (517,1878)
    As said before, this will only work if the list of intervals is not huge.

    It will also only find the first interval that matches. if you want to find all intervals that match, replace the for-loop by:

    for my $datapoint (@FILE1) { my @found = grep {$_->contains($datapoint)} @interval_objects; say "$datapoint is in @found" if @found; }

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Comparing FILE1 value to FILE2 range and printing matches
by Laurent_R (Canon) on Oct 18, 2014 at 11:12 UTC
    The simple fix suggested by RichardK is working, but is not a very efficient way to do such a thing, as pointed by RichardK himself. It is usually better to load at least one of the files into memory (as an array, a hash or some other data structure, but, as Richard already asked, we would need to know how large your two files are in order to be able to provide more guidance on how to do it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1104164]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-25 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found