Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Today i am writing in order to obtain some wisdom on my current topic. My task is to write a fairly simple Perl program that 1) Opens a large 1000 genome file. 2) Opens 75 different Files each with 50 IDentifiers 3) In a loop...Starting with File 1 Identifier 1 if (File 1 ID 1 matches 2nd column of the 1000 genome file) extract the whole line from the 1000 genome file and store it in an output that is labeled File1_out. Lastly increase the ID to the 2nd element of 50. Do this for all the identifier in File 1, then after 50th ID open File2 and do the same thing again. Basically do this for all of the 75 Files. I think i should only open each of the files one time and work with arrays.

You have not included samples from these files, that would have significantly helped us visualize your idea. Is the second column in the 1000 genome file a checksum val or ? Or does this file have another custom format ? How about the other 75 files, what do they have?

My suggestion is to utilize hashes since you're not concerned with the ordering of the values in the 1K genome file but rather whether or not these values have matches in the other files. Hashes facilitate quick searching. So for the 1K file, read the second column into a hash:

#Untested code for lack of example input by the OP use strict; #enforces predeclaration of variables, better scoping. use warnings; #tells you of errors or violations in your code use Data::Dumper; #visualize your data structures my %hash; #declaring a global variable to hold the desired column open(my $fh, "<", "1k_genome_file.bas") or die("could not open file $! +\n"); while(my $line = <$fh>){ chomp $line; #split around a delimiter (a space, a tab, a comma...etc). my @array=split(/\t/,$line); #get the second column my $second_column=$array[1]; $hash{$second_column}=1; } print Dumper(\%hash) #see if the hash looks like what is expected.

In the line $hash{$second_column}=1;: entries in second column are used as hash keys, giving each entry a value of 1. Repeated entries will be overwritten, this way you avoid duplicated lines in the 1k file. Alternatively, if your second column is made up of unique entries, you can use the line number where the entry in the file occurred as your hash value (that will be helpful later when you want to access the lines to be extracted). Here is a list of relevant posts reading the file using the line number, Line number in a file and Best way to read line x from a file.

Now that you have read the desired column into a hash, you can then iterate over the other files in the folder and for each file check if the hash keys match that line then extract them from the 1k file. The module File::Find is a friendly way to iterating over folders.

As other monks have suggested, turning on strict and warnings is a good coding behavior

Something or the other, a monk since 2009

In reply to Re: Dear Venerable Monks by biohisham
in thread Dear Venerable Monks by A1 Transcendence

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2022-01-17 11:26 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (51 votes). Check out past polls.