Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Using hash keys to separate data

by a217 (Novice)
on Jun 29, 2011 at 04:52 UTC ( [id://911894]=perlquestion: print w/replies, xml ) Need Help??

a217 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perlmonks,

I have a file (hashKey.txt) that I would like to use as a list of hash keys in order to separate data from input files (e.g. testReg.txt).

hashKey.txt

chr10 chr10_random chr11 chr11_gl000202_random chr11_random chr12 chr13 chr13_random chr14 chr15 chr15_random chr16 chr16_random chr17_ctg5_hap1 chr17 chr17_gl000203_random chr17_gl000204_random chr17_gl000205_random chr17_gl000206_random chr17_random chr18 chr18_gl000207_random chr18_random chr19 chr19_gl000208_random chr19_gl000209_random chr19_random chr1 chr1_gl000191_random chr1_gl000192_random chr1_random chr20 chr21 chr21_gl000210_random chr21_random chr22 chr22_h2_hap1 chr22_random chr2 chr2_random chr3 chr3_random chr4_ctg9_hap1 chr4 chr4_gl000193_random chr4_gl000194_random chr4_random chr5 chr5_h2_hap1 chr5_random chr6_apd_hap1 chr6_cox_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap2 chr6_qbl_hap6 chr6_random chr6_ssto_hap7 chr7 chr7_gl000195_random chr7_random chr8 chr8_gl000196_random chr8_gl000197_random chr8_random chr9 chr9_gl000198_random chr9_gl000199_random chr9_gl000200_random chr9_gl000201_random chr9_random chrM chrUn_gl000211 chrUn_gl000212 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 chrUn_gl000217 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000226 chrUn_gl000227 chrUn_gl000228 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000232 chrUn_gl000233 chrUn_gl000234 chrUn_gl000235 chrUn_gl000236 chrUn_gl000237 chrUn_gl000238 chrUn_gl000239 chrUn_gl000240 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000244 chrUn_gl000245 chrUn_gl000246 chrUn_gl000247 chrUn_gl000248 chrUn_gl000249 chrX chrX_random chrY

hashKey.txt give a list of all the possible chromosome values there could be in a given input file

testReg.txt

chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 chr3 450 700 0 chr4 100 300 0 chr7 350 600 0 chr9 100 125 0 chr11 679 687 0 chr22 100 200 0 chr22 300 400 0

testReg.txt is simply a test file I use to test the code. It includes various chromosome values along with 3 other columns of data.

My code so far:

#!/usr/bin/perl use warnings; use strict; my (%Chr, %R); my (@key_split, @reg_split); my ($reg_line); open(KEY, "<hashKey.txt") or die "error reading key list"; open(REG, "<testReg.txt") or die "error reading file"; while (<KEY>) { chomp; @key_split = split("\n"); $Chr{"$key_split[0]"} = $key_split[0]; } while (<REG>) { chomp; @reg_split = split("\t"); #$R{"$reg_split[0]"} = ($reg_split[0], $reg_split[1], $reg_split[2 +], $reg_split[3]); $R{"$reg_split[0]"} = $reg_split[0]; } foreach my $key (keys %Chr) { if(exists($R{$key})){ print ("$R{$key}\n"); } } close(KEY); close(REG);

So far, my code prints out all of the chr values in common between hashKey.txt and testReg.txt. What I would like it to do is to print each line to a separate file designated by each chromosome. For example:

chr1.out

chr1 100 159 0 chr1 200 260 0 chr1 500 750 0

chr3.out

chr3 450 700 0

chr4.out

chr4 100 300 0

chr7.out

chr7 350 600 0

chr9.out

chr9 100 125 0

chr11.out

chr11 679 687 0

chr22.out

chr22 100 200 0 chr22 300 400 0

From there I can use each separated file to sort what I need to. I suppose my main problem is trying to figure out how to have the hash variable point toward the unique line. Is what I am trying to accomplish even possible with hash table given that the key could be used for multiple lines? My main goal is to just separate each chr from the input file (testReg.txt) into separate files. If you have any suggestions please let me know.

Replies are listed 'Best First'.
Re: Using hash keys to separate data
by wfsp (Abbot) on Jun 29, 2011 at 06:04 UTC
    Nearly there. :-)
    #!/usr/bin/perl use warnings; use strict; open(KEY, "<hashKey.txt") or die "error reading key list"; open(REG, "<testReg.txt") or die "error reading file"; my %Chr; while (my $key = <KEY>) { chomp $key; $Chr{$key} = undef; } my %R; while (my $reg = <REG>) { chomp $reg; my @reg_split = split("\t", $reg); push @{$R{$reg_split[0]}}, $reg; } foreach my $key (sort keys %R) { next unless exists $Chr{$key}; for my $out (@{$R{$key}}){ print "$out\n"; } print q{-} x 20, qq{\n}; } close(KEY); close(REG);
    chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 -------------------- chr11 679 687 0 -------------------- chr22 100 200 0 chr22 300 400 0 -------------------- chr3 450 700 0 -------------------- chr4 100 300 0 -------------------- chr7 350 600 0 -------------------- chr9 100 125 0 --------------------
    The first while loop creates a lookup table (%Chr). The source file only has 1 field per record so there is no need for the split.

    The second while loop creates a hash of arrays (%R) from your input file. The key is the first field (chromosome) and the value is an array of records. That's what the push is doing.

    Finaly we print the records for each chromosome if it exists in the lookup table. In your case you want to print to a file rather than STDOUT as we do here.

    As an aside, you could rewrite the first while loop with map.

    Hope that helps.

    Update
    Reading your question again I see

    hashKey.txt gives a list of all the possible chromosome values there could be in a given input file.
    If that is the case why do you need the lookup table? I could see it being useful if there could be values in your input that you weren't interested in.
Re: Using hash keys to separate data
by bart (Canon) on Jun 29, 2011 at 07:09 UTC
    If you split the lines into 2 parts, instead of in as many as the line contains, then you'll keep the entire row. Also, you can assign to a list of scalars, whih is easier to handle than an array.

    And for the rest, as wfsp already said: push the data onto the anonymous array which comprise the values of the hash (autovivified, so don't worry about the anonymous array not existing).

    while (<REG>) { chomp; my($key, $data) = split "\t", $_, 2; push @{$R{$key}}, $data; }
    After that it's just a matter of looping through the keys, and print out the contents of the array.
    foreach my $key (keys %R) { open my $fh, '>', "$key.out" or die "Cannot open file $key.out: $! +"; foreach my $row (@{$R{$key}}) { print $fh "$key\t$row\n"; } }
Re: Using hash keys to separate data
by Marshall (Canon) on Jun 29, 2011 at 09:07 UTC
    My main goal is to just separate each chr from the input file (testReg.txt) into separate files. If you have any suggestions please let me know.

    You are over-thinking this. You don't even need the file: hashKey.txt. The file testReg.txt is I think the 15GB monster file. If this file is not already sorted, use the system command line sort to do that. The command line sort can sort things way bigger than the size of memory.

    Now all of the lines that have the same chromosome will be grouped together in the file. We just read the file and every time we switch to a new chromosome, we start a new file.

    #!/usr/bin/perl -w use strict; my $curr_chrom = ""; while (<DATA>) { my ($chrom) = split; # $chrom is the first column # parens on the left side are needed # for list context if ($chrom ne $curr_chrom) { $curr_chrom = $chrom; open (OUT, '>', "$curr_chrom.out") or die "unable to write $curr_chrom.out $!\n"; } print OUT; } close OUT; __DATA__ chr1 100 159 0 chr1 200 260 0 chr1 500 750 0 chr3 450 700 0 chr4 100 300 0 chr7 350 600 0 chr9 100 125 0 chr11 679 687 0 chr22 100 200 0 chr22 300 400 0
    A few notes: If a file handle is open to one file and it is used again and opened to another file, the first file is closed automatically (no need to close it explicitly). For your data, normally you want to split on any series of white space characters split(/\s+/,$_) is the "default" split and is what is used by: $chrom = split;. Trying to split on \t is probably and certainly \n is not what you want.

    Update: From the wording of the post, I don't think that you are interested in a subset of the chromosomes in the input file, but if you were, then here's how. Make a hash table with keys being the chromosomes that you want. In the above program, when the chromosome changes (the if statement), test if the chromosome is on the "approved" list (name exists in the hash table) or not. If it does exist, then open OUT to that name like above, if it does not, then open OUT to "/dev/null". /dev/null is a special device that discards all stuff written to it (it is the "bit bucket"). That way you always execute the print OUT; statement. Sometimes it goes somewhere useful and sometimes into the black hole of bits.

    To make the hash, your code:

    while (<KEY>) { chomp; @key_split = split("\n"); $Chr{"$key_split[0]"} = $key_split[0]; } ## better written as: ## while (<KEY>) { my ($chrom) = split; $Chr{$chrom}=1; }

      Marshall,

      I suppose I was over-thinking it. Your method looks to read in constant time, and for a large input file that I'm working with I think that may be beneficial.

      The only reason I included the key list is because I thought that would be the easiest way to separate the input data into separate files. However, the input data is already well-sorted so your method should work.

      One more question in general: with my code and the suggestions everyone has given, there is still an error message (despite the fact that the output is correct). Is there any way to get rid of this error message or is it just something I am going to have to deal with? The message refers to uninitialized value errors, and I was trying to fix this before. However, I suppose if the output is still correct that is the only thing that matters.

        Yes, once sorted, the algorithm just reads the file once in a linear fashion. So this should be great for your humongous file.

        The warning message should give you a line number in the code and often you also get the line number of the input file. One common way to get an uninitialized value is when there is a blank line in the file - this causes the split to fail (no results). An extra carriage return is easy to get missed since they are "invisible". I often put: next if /^\s*$/; which will go to the next input line if the current line contains nothing but white spaces.

        I think that I already mentioned that normally you probably should be splitting on the regex /\s+/ which is the default. white space (\s) includes all of the following: the space of course,\n\r\f\t any contiguous sequence of those gets removed. Splitting on just tab characters (\t) can cause problems if there are sometimes extra space characters in there that you cannot see with the editor.

        I think you are on the right track - keep at it!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://911894]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2024-04-24 00:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found