ihperlbeg has asked for the wisdom of the Perl Monks concerning the following question:

My input data file looks something like this (only sample shown)

#id1 1 90 2 80 3 70 #id2 1 70 2 40 2 40 3 20 4 5 #id3 0 0 0 0 0 0

I am trying to output this input in the following format:

id1 id2 id3 0 0 0 0 0 0 1 90 70 2 80 40 2 40 3 70 20 4 5

I tried using hashes, but didn't seem to work as the first column has duplicates. Any help how I can accomplish this. Thanks!

Replies are listed 'Best First'.
Re: transforming XY data to X and Multiple Y column data?
by Kanji (Parson) on Sep 26, 2010 at 20:24 UTC

    You could try a more complex data structure.

    A something of somethings of arrays should do the trick, such as an array of arrays of arrays:-

    @id = ( [ [ ], # id1-0 [ ], # id2-0 [ 0, 0, 0 ], # id3-0 ], [ [ 90, ], # id1-1 [ 70, ], # id2-1 [ ], # id3-1 ], # ... );

    ...or an array of hashes of arrays:-

    @id = ( { id3 => [ 0, 0, 0 ], }, { id1 => [ 90, ], id2 => [ 70, ], }, # ... );

    ...or even a hash of hashes of arrays:-

    %id = ( 0 => { id3 => [ 0, 0, 0 ], }, 1 => { id1 => [ 90, ], id2 => [ 70, ], }, # ... );

    Without knowing more about your input data, it's hard to say which of those will best suit your needs (or if something else is more appropriate or possible), but the of arrays part will at least do away with the issue of duplicates.


    Edit: Fixed typos in examples.

Re: transforming XY data to X and Multiple Y column data?
by perlpie (Beadle) on Sep 27, 2010 at 03:09 UTC

    The key is to figure out how you would do this without code. What things did you iterate through in what order when you formulated your desired output? Take careful note and then codify that process. This gives your desired output:

    #!/usr/bin/perl use warnings; use strict; my $id; my %data; my %seen; while (<main::DATA>) { chomp; next unless /\S/; if (/\A#(id\d+)\z/) { $id = $1; $seen{'id'}{$id} = 1; } elsif (/\A(\d+)\s+(\d+)\z/) { die "no #id found before line: '$_'" unless $id; $seen{'key'}{$1} = 1; $data{$1}{$id} ||= []; push(@{$data{$1}{$id}}, $2); } else { die "unrecognized line: '$_'"; } } my ($width) = (sort {$b <=> $a} map { length } map { keys %$_ } values + %seen); my @ids = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { /id(\d+)/; [$_, $1] } keys %{$seen{'id'}}; print join(' ', map { sprintf "%${width}s", $_ } '', @ids), "\n"; for my $i (sort keys %{$seen{'key'}}) { while (keys %{$data{$i}}) { print join(' ', map { sprintf "%${width}s", $_ } $i, map { exi +sts($data{$i}{$_}) && @{$data{$i}{$_}} ? shift @{$data{$i}{$_}} : '' } keys %{$seen{'id'}}), "\n"; for my $id (keys %{$data{$i}}) { delete $data{$i}{$id} unless @{$data{$i}{$id}}; } } } __DATA__ #id1 1 90 2 80 3 70 #id2 1 70 2 40 2 40 3 20 4 5 #id3 0 0 0 0 0 0
Re: transforming XY data to X and Multiple Y column data?
by LanX (Sage) on Sep 27, 2010 at 09:31 UTC

    For me it looks like that you simply need an "array of hashes" as datastructure, with the IDs as keys.

    The order of these keys should be preserved in an extra array (the "headline").

    output would be simply:

    # print headline-keys # loop over indices # print $index,"\t"; # loop over ids in headline # print $data[$index]{$id},"\t"; # print newline

    Hope this gives you the basic idea how to do it by yourself.

    If data entries are to long for tab-delimiters you should have a look at format

    Cheers Rolf

Re: transforming XY data to X and Multiple Y column data?
by sundialsvc4 (Abbot) on Sep 27, 2010 at 14:04 UTC

    For a task like this one, everything depends on your data-structure.   It appears that your data consists of “lists of zero-or-more values” indexed by two keys.   The first key is introduced at the beginning of each group, e.g. id1.   The second is introduced with each row.

    The most appropriate Perl structure, it seems to me, is:   “a hashref of hashrefs of arrayrefs.”   (If you happen to know that the second dimension is always contiguous integers, it might be “a hashref of arrayrefs of arrayrefs.”)

    Each time you encounter a line which introduces a new key (such as id3), you will create a new hashref entry for it, and remember what it is.   When you encounter a line of the other kind, the process is similar.   You know that an entry for the major-key (id3) exists.   So, now, check that this hashref contains the minor-key; create a new empty entry (containing an arrayref) if it doesn’t.   push the new value onto this array.

    Output will make heavy use of sort keys hashref.

    You may need to use a sort-function in the sort clauses to ensure that comparisons are numeric.

    The program is not quite straightforward, but it is uncomplicated.

Re: transforming XY data to X and Multiple Y column data?
by Anonymous Monk on Sep 26, 2010 at 19:47 UTC
    1. show your effort

      here is what I have and as I said it didn't work with the data format I am working with:

      #!/usr/bin/perl use strict; use integer; my $input = shift @ARGV || 'Data.txt'; my $output = shift @ARGV || 'Output.txt'; print $input, "\n"; open(DATA,"$input") || die "cannot open $input for reading"; open(OUT, ">$output") || die "cannot open $output for writing" +; my @newcols=(); my ($genome, $id, $abd); my %phage=(); my @pg=(); my @id=(); while (my $line=<DATA>){ chomp $line; #$line =~ s/"//g; if($line =~ m/#/){ $genome = $line; push(@pg, $genome);} elsif($line !=~ m/#/){ my @cols = split(/\t/, $line); $id = $cols[0]; $abd = $cols[1]; push(@id, $id); $phage{$id}{$genome}=$abd;} #print OUT "$genome $id, $abd\n"; } my %hash = map { $_ => 1 } @id; my @unique = keys %hash; my @sorted_id = sort { ( $a <=> $b) } @unique; print OUT "\t"; for my $phagegenome(@pg){ print OUT $phagegenome, "\t"; } print OUT "\n"; for my $sorted_id(@sorted_id){ print OUT $sorted_id, "\t"; for my $pg(@pg){ print OUT "$phage{$sorted_id}{$pg}\t"; } print OUT "\n"; }
        In the OP, you say:

        I tried using hashes, but didn't seem to work as the first column has duplicates.

        But just above that, you show lines that you are "trying to output" and those have duplicates in the first column. Either you actually do want some duplicates (apparently, when the same "column 1" value appears more than once under a single "id" value), or else you made a mistake in showing us what you are "trying to output".

        Apart from that, I have a question about your input data. Your code assumes that the columns in your "Data.txt" file are separated by tabs, but the sample data you posted appears to be separated by spaces.

        On top of that, I have to complain about your strange use of indentation. It runs, but it's hard to read.

        The output I got using the OP data (with spaces instead of tabs) looked like this:

        #id1 #id2 #id3 0 0 1 90 1 70 2 80 2 40 3 70 3 20 4 5

        First suggestion: since you are using a data structure, try using Data::Dumper together with the perl debugger perldebug -- e.g. you use the debugger "b" command to set a break-point at a specific line (like, inside a "for" loop), use the "c" command to continue execution (which will then stop at the break-point), and use a command like "p Dumper(\%phage)" to see what your data structure contains at that point.

        When I did that on (a properly indented version of) your code, I noticed the "tab-vs-space" issue, changed the "split" regex, and got something closer to what you seem to want (updated to convert tabs to spaces):

        #id1 #id2 #id3 0 0 1 90 70 2 80 40 3 70 20 4 5
        Of course, this does not have any repeated values in the first column, which makes it different from your OP output sample. So... do you really want some things repeated and not others?

        BTW, here's a cleaned-up version of your code, fixing the indentation and the split regex, adding Data::Dumper, moving some uses of "my" into their proper scope, and simplifying a few other things (also removed "use integer" -- it would have no effect on the code as-is, and if you ever do division, it's better to use the "int()" function as needed):

        There's still room for improvement... (especially if you really want it to do something different).
        Here is my version, reads from __DATA__ if there is no 'Data.txt' output seems to be what you want (also included debugging DDS Dump )
Re: transforming XY data to X and Multiple Y column data?
by zappepcs (Acolyte) on Sep 29, 2010 at 02:23 UTC
    Not sure how you ended up doing it, but if you simply create an array for every row, scroll through hashes and assign the right columnar data to the right row array. The row arrays can be elements of a hash to help make it easier to deal with. @{$matrix{"1"}} = @row1 $$matrix{"1"}}[0] = 90; $$matrix{"1"}}1 = 70 and so forth, where {"1"} is the row num of your original data and the key value of the hash, and the 1 is the #id'x' number or column number or element number of the hash - might put totals in column zero or whatever...