Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Adding missing values into a hash

by Biopolete (Initiate)
on Jun 18, 2014 at 19:05 UTC ( [id://1090340]=perlquestion: print w/replies, xml ) Need Help??

Biopolete has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I' m trying to convert a column to a csv file.

My data looks like:

##INFO=<ID=AA,

##INFO=<ID=AB,

##INFO=<ID=AC,

Num Data

1 AA=1;AB=2;AC=3

2 AA=2;AB=2

3 AA=5;AB=1;AC=1

And I want a csv like this:

AC AB AC

1 2 3

2 2 NA

5 1 1

First of all I do a hash obtaining the keys from the metadata (##)

open(I1,$ARGV[0]); my %info; while (my $line = <I1>) { if ($line =~ /##INFO=<ID=/) { my ($first,$second) = (split(/\,/, $line)); my ($firstsecond,$secondsecond) = (split(/ID=/, $first)); $info{$secondsecond}=(); } }

Now I have my hash “info” with my keys (AA, AB, AC)

Then I want to introduce the values.

I start with:

while (my $line = <I1>) { if ($line !~ /#/) { my ($numbers,$data) = (split(/\t/, $line)); foreach my $dat ($data){ my ($string, $int) = (split(/\;/, $dat));

That is because I want to eliminate “\t” and “;”

But I don't know how to introduce the missing values (NA)

I want something like this:

AA => 1,2,5

AB => 2,2,1

AC => 3,NA,1

Anyone knows how to introduce the “NA” string in its correct position?? (my real file is much bigger with a lot of “NA”)

Thank you very much.

Replies are listed 'Best First'.
Re: Adding missing values into a hash
by Laurent_R (Canon) on Jun 18, 2014 at 20:07 UTC
    First, your output is not exactly a CSV (comma separated value) format.

    Second You would probably better off storing your metadata keys in an array rather than a hash, because an array preserves the order of the data (and not a hash).

    Third, a regex might be simpler than a split if you just want to remove the trailing comma:

    while (my $line = <I1>) { chomp $line; if ($line =~ /##INFO=<ID=/) { $line =~ s/,$//; # ... }
    Fourth, I do not see any \t in tour input.

    Fifth, reading the file twice does not seem to be a very good idea. Can't you decide, based on the content, that you have finished reading the metadata and started to read the data?

      Thank you very much for your answer :)

      You are right about the csv format, I was thinking in a excel.

      I thing that storing my metadata keys in an array rather than a hash it would be better, the problem is that I don't know how to asociate the metadata in the array with the values without doing it with a hash.

      Your third and fifth points look like very interesting, but I am "noob" and i don't know how to do it :(

Re: Adding missing values into a hash
by McA (Priest) on Jun 18, 2014 at 19:58 UTC

    Hi,

    in your inner loop you have to replace the last line

    my ($string, $int) = (split(/\;/, $dat));

    by the following

    my @elements = split /;/, $data; my %rowvalues; foreach my $element (@elements) { my ($key, $value) = split /=/, $element; $rowvalues{$key} = $value; } foreach my $key (keys %info) { if(exists $rowvalue{$key}) { push @{$info{$key}}, $rowvalue{$key}; } else { push @{$info{$key}}, 'NA'; } }

    I hope that is it. I haven't tested. Please put code tags around your sample data so we can see the structure better.

    Regards
    McA

      Thank you very much for your answer :)

      I was trying something similar, but the problem is that at the end I obtain a hash of hashes, I don't know why. And it's imposible working with them.

        Hi

        I was wondering about your answer and therefore made this selfcontained snippet which should show the relevant elements.

        #!/bin/env perl use strict; use warnings; use 5.010; my %info; while (my $line = <DATA>) { chomp $line; if ($line =~ /##INFO=<ID=/) { my ($first, $second) = split /,/, $line; my ($firstsecond, $secondsecond) = split /ID=/, $first; $info{$secondsecond}=(); } elsif ($line !~ /#/) { my ($numbers, $data) = split /\s+/, $line; foreach my $dat ($data){ my @elements = split /;/, $data; my %rowvalues; foreach my $element (@elements) { my ($key, $value) = split /=/, $element; $rowvalues{$key} = $value; } foreach my $key (keys %info) { if(exists $rowvalues{$key}) { push @{$info{$key}}, $rowvalues{$key}; } else { push @{$info{$key}}, 'NA'; } } } } else { next; } } foreach my $header (sort keys %info) { say $header, ' => ', join(',', @{$info{$header}}); } __DATA__ # First the headers ##INFO=<ID=AA, ##INFO=<ID=AB, ##INFO=<ID=AC, # then the data 1 AA=1;AB=2;AC=3 2 AA=2;AB=2 3 AA=5;AB=1;AC=1

        I hope this will clarify what was said before. I change one split from '\t' to '\s+' because of pasting this code herein would probably destroy tghe tab character.

        Regards
        McA

Re: Adding missing values into a hash
by poj (Abbot) on Jun 18, 2014 at 19:57 UTC
    Try
    #!perl use strict; use Text::CSV; my %info=(); my $line_count=0; while (my $line = <DATA>){ chomp($line); if ($line =~ /##INFO=<ID=([^,]+)/){ $info{$1}=[]; } else { my (undef,%hash) = split /[\t;=]/,$line; for (keys %info){ push @{$info{$_}},$hash{$_} || 'NA'; } ++$line_count; } } my $csv = Text::CSV->new ( {binary=>1, eol=>"\012"} ) or die "Cannot use CSV: ".Text::CSV->error_diag(); open my $fh,'>','output.csv' or die "Could not open output.csv $!"; my @col_head = sort keys %info; $csv->print($fh, \@col_head); for my $i (1..$line_count){ my @row = map { $info{$_}[$i-1] } @col_head; $csv->print($fh, \@row); } __DATA__ ##INFO=<ID=AA, ##INFO=<ID=AB, ##INFO=<ID=AC, 1 AA=1;AB=2;AC=3 2 AA=2;AB=2 3 AA=5;AB=1;AC=1
    poj

      Thank you very much for your answer :)

      Your answer seems quite interesting but I don't know why but I obtain too many "NA".

      With de first part of the script I obtain for example

      AA=> NA, NA,1,2,5,NA,NA,NA,NA,NA

      instead of

      AA=> 1,2,5

      Perhaps is related with

      my (undef,%hash) = split /[\t;=]/,$line;

      because you are spliting 3 times, I don't know.

      The final csv is

      AA => NA,NA,NA

      AB => NA,NA,NA

      AC => NA,NA,NA

      Perhaps is because the problem with de "NA".

        Do you have other lines in the file apart from those like
        ##INFO=<ID=AA, and 1 AA=1;AB=2;AC=3 ? Blank lines for example.

        poj
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1090340]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-28 23:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found