Biopolete has asked for the wisdom of the Perl Monks concerning the following question:
Hi
I' m trying to convert a column to a csv file.
My data looks like:
##INFO=<ID=AA,
##INFO=<ID=AB,
##INFO=<ID=AC,
Num Data
1 AA=1;AB=2;AC=3
2 AA=2;AB=2
3 AA=5;AB=1;AC=1
And I want a csv like this:
AC AB AC
1 2 3
2 2 NA
5 1 1
First of all I do a hash obtaining the keys from the metadata (##)
open(I1,$ARGV[0]);
my %info;
while (my $line = <I1>) {
if ($line =~ /##INFO=<ID=/) {
my ($first,$second) = (split(/\,/, $line));
my ($firstsecond,$secondsecond) = (split(/ID=/, $first));
$info{$secondsecond}=();
}
}
Now I have my hash info with my keys (AA, AB, AC)
Then I want to introduce the values.
I start with:
while (my $line = <I1>) {
if ($line !~ /#/) {
my ($numbers,$data) = (split(/\t/, $line));
foreach my $dat ($data){
my ($string, $int) = (split(/\;/, $dat));
That is because I want to eliminate \t and ;
But I don't know how to introduce the missing values (NA)
I want something like this:
AA => 1,2,5
AB => 2,2,1
AC => 3,NA,1
Anyone knows how to introduce the NA string in its correct position?? (my real file is much bigger with a lot of NA)
Thank you very much.
Re: Adding missing values into a hash
by Laurent_R (Canon) on Jun 18, 2014 at 20:07 UTC
|
First, your output is not exactly a CSV (comma separated value) format.
Second You would probably better off storing your metadata keys in an array rather than a hash, because an array preserves the order of the data (and not a hash).
Third, a regex might be simpler than a split if you just want to remove the trailing comma:
while (my $line = <I1>) {
chomp $line;
if ($line =~ /##INFO=<ID=/) {
$line =~ s/,$//;
# ...
}
Fourth, I do not see any \t in tour input.
Fifth, reading the file twice does not seem to be a very good idea. Can't you decide, based on the content, that you have finished reading the metadata and started to read the data?
| [reply] [Watch: Dir/Any] [d/l] |
|
Thank you very much for your answer :)
You are right about the csv format, I was thinking in a excel.
I thing that storing my metadata keys in an array rather than a hash it would be better, the problem is that I don't know how to asociate the metadata in the array with the values without doing it with a hash.
Your third and fifth points look like very interesting, but I am "noob" and i don't know how to do it :(
| [reply] [Watch: Dir/Any] |
Re: Adding missing values into a hash
by McA (Priest) on Jun 18, 2014 at 19:58 UTC
|
my ($string, $int) = (split(/\;/, $dat));
by the following
my @elements = split /;/, $data;
my %rowvalues;
foreach my $element (@elements) {
my ($key, $value) = split /=/, $element;
$rowvalues{$key} = $value;
}
foreach my $key (keys %info) {
if(exists $rowvalue{$key}) {
push @{$info{$key}}, $rowvalue{$key};
}
else {
push @{$info{$key}}, 'NA';
}
}
I hope that is it. I haven't tested. Please put code tags around your sample data so we can see the structure better.
Regards
McA | [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thank you very much for your answer :)
I was trying something similar, but the problem is that at the end I obtain a hash of hashes, I don't know why. And it's imposible working with them.
| [reply] [Watch: Dir/Any] |
|
#!/bin/env perl
use strict;
use warnings;
use 5.010;
my %info;
while (my $line = <DATA>) {
chomp $line;
if ($line =~ /##INFO=<ID=/) {
my ($first, $second) = split /,/, $line;
my ($firstsecond, $secondsecond) = split /ID=/, $first;
$info{$secondsecond}=();
}
elsif ($line !~ /#/) {
my ($numbers, $data) = split /\s+/, $line;
foreach my $dat ($data){
my @elements = split /;/, $data;
my %rowvalues;
foreach my $element (@elements) {
my ($key, $value) = split /=/, $element;
$rowvalues{$key} = $value;
}
foreach my $key (keys %info) {
if(exists $rowvalues{$key}) {
push @{$info{$key}}, $rowvalues{$key};
}
else {
push @{$info{$key}}, 'NA';
}
}
}
}
else {
next;
}
}
foreach my $header (sort keys %info) {
say $header, ' => ', join(',', @{$info{$header}});
}
__DATA__
# First the headers
##INFO=<ID=AA,
##INFO=<ID=AB,
##INFO=<ID=AC,
# then the data
1 AA=1;AB=2;AC=3
2 AA=2;AB=2
3 AA=5;AB=1;AC=1
I hope this will clarify what was said before. I change one split from '\t' to '\s+' because of pasting this code herein would probably destroy tghe tab character.
Regards
McA | [reply] [Watch: Dir/Any] [d/l] |
|
Re: Adding missing values into a hash
by poj (Abbot) on Jun 18, 2014 at 19:57 UTC
|
#!perl
use strict;
use Text::CSV;
my %info=();
my $line_count=0;
while (my $line = <DATA>){
chomp($line);
if ($line =~ /##INFO=<ID=([^,]+)/){
$info{$1}=[];
} else {
my (undef,%hash) = split /[\t;=]/,$line;
for (keys %info){
push @{$info{$_}},$hash{$_} || 'NA';
}
++$line_count;
}
}
my $csv = Text::CSV->new ( {binary=>1, eol=>"\012"} )
or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $fh,'>','output.csv'
or die "Could not open output.csv $!";
my @col_head = sort keys %info;
$csv->print($fh, \@col_head);
for my $i (1..$line_count){
my @row = map { $info{$_}[$i-1] } @col_head;
$csv->print($fh, \@row);
}
__DATA__
##INFO=<ID=AA,
##INFO=<ID=AB,
##INFO=<ID=AC,
1 AA=1;AB=2;AC=3
2 AA=2;AB=2
3 AA=5;AB=1;AC=1
poj | [reply] [Watch: Dir/Any] [d/l] |
|
Thank you very much for your answer :)
Your answer seems quite interesting but I don't know why but I obtain too many "NA".
With de first part of the script I obtain for example
AA=> NA, NA,1,2,5,NA,NA,NA,NA,NA
instead of
AA=> 1,2,5
Perhaps is related with
my (undef,%hash) = split /[\t;=]/,$line;
because you are spliting 3 times, I don't know.
The final csv is
AA => NA,NA,NA
AB => NA,NA,NA
AC => NA,NA,NA
Perhaps is because the problem with de "NA".
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|