http://qs321.pair.com?node_id=894701

astrobio has asked for the wisdom of the Perl Monks concerning the following question:

I have to fill in missing values in a sparse matrix of 50 columns and millions of rows. The missing values are filled importantly from the next good value in a column. The physical problem is from a signal generator where each column has a different sample rate. Any suggestions on how to make the following fragile code more robust to backfill missing values?
while (<DATA>){ s/^\s+|\s+//g; if(length($_)<1){ next;} @f=split(/,/,$_); push(@col1,{TIME=>$f[0],GOOD=>$f[1]}); push(@col2,{TIME=>$f[0],GOOD=>$f[2]}); if($unique{$f[0]}!=1){ push(@times,$f[0]); $unique{$f[0]}=1; } } # reverse sort by col1 values to backfill # forward sort by col1 time to catch last good foreach(sort {$$b{GOOD}<=>$$a{GOOD} || $$a{TIME}<=>$$b{TIME}} @col1){ if( ($$_{TIME}<$last_time) && ($$_{GOOD} == -999.9) ) { $out1{$$_{TIME}}=$last_good; }elsif($$_{GOOD} == -999.9){ $out1{$$_{TIME}}=$last_good; }else{ $last_time=$$_{TIME}; $last_good=$$_{GOOD}; $out1{$$_{TIME}}=$$_{GOOD}; } } foreach(sort {$$b{GOOD}<=>$$a{GOOD} && $$a{TIME}<=>$$b{TIME}} @col2){ if( ($$_{TIME}<$last_time) && ($$_{GOOD} == -999.9) ) { $out2{$$_{TIME}}=$last_good; }elsif($$_{GOOD} == -999.9){ $out2{$$_{TIME}}=$last_good; }else{ $last_time=$$_{TIME}; $last_good=$$_{GOOD}; $out2{$$_{TIME}}=$$_{GOOD}; } } foreach $time (@times){ print "$time,$out1{$time},$out2{$time}\n"; } # time,col1,col2 # (-999.9 out-of-range to indicate missing # and distinguish from true zeroes) __DATA__ 0.01,-999.9,1 0.02,-999.9,-999.9 0.03,-999.9,3 0.04,2,-999.9 0.05,-999.9,-999.9 0.06,5,4
which currently gives output
# time,col1,col2 0.01,2,1 0.02,2,1 0.03,2,3 0.04,2,1 0.05,2,1 0.06,5,4

It seems possible to do with just one data pass if the missing values are filled from the last good value only, and the first row was guaranteed filled to start.

Replies are listed 'Best First'.
Re: How best to fill missing values in a sparse matrix?
by BrowserUk (Patriarch) on Mar 22, 2011 at 05:34 UTC
    The missing values are filled importantly from the next good value in a column.

    Following that criteria, your output for the sample input should be:

    0.01,2,1 0.02,2,3 0.03,2,3 0.04,2,4 0.05,5,4 0.06,5,4

    Which is quite different from the sample output you've provided?

    Ie:

    next next good good valu valu 0.01, v -999.9, * 1 0.02, v -999.9, v -999.9 0.03, v -999.9, * 3 0.04, * 2, v -999.9 0.05, v -999.9, v -999.9 0.06, * 5, * 4

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How best to fill missing values in a sparse matrix?
by ELISHEVA (Prior) on Mar 22, 2011 at 07:18 UTC

    I'm a bit confused about your logic. You indicate that you want to look ahead for a good value, but you also seem to be relying on that first good row for missing values as well. (as indicated by your sample output and your mention of the fact that the first line is guaranteed to be good ). Also you say that the first line is guaranteed to be filled, but your sample input has out of range values. Does filled mean has a value (even out of range) or does filled mean "have an in-range value"? What determines when you look ahead and when you look behind?

    Also how many lines do you typically need to look ahead before you find a complete set of good values? And do you even need a complete set, or do you just need in-range values for the columns explicitly specified in the current line? Put another way, will your output file have 50 values for each row? Or will it only have 2 values if the input row also had two values?

    Assuming you always want to look forward for the next best value, why not use the *nix command tac to put the file in reverse order (last line first)?

    This would allow a much simpler look-behind After you read enough lines to fill up your default array with good values for each column, your script would never need to hold more than two lines in memory at a time (the composite good value array and the current line), no matter how sparse the data nor how far away the next good value is.

    Of course, if you are processing a real-time feed or do not have access to tac or an equivalent, then maybe you have no choice but to do a look ahead since you never really know what the "next" good value is until it happens, but if you don't really need to do it, there is no sin in taking the easy way out and using the tools at hand.

    Update: realized that there are lots of unanswered questions here and added some. Also withdrew suggestion of using tac - either reading forward or backwards there are missing values, one will still need a multi-line look ahead to construct a full set of good values.

Re: How best to fill missing values in a sparse matrix?
by wind (Priest) on Mar 22, 2011 at 09:56 UTC

    It looks like you're trying to do too much in your code. If you're goal is to just fill in NULL values, then focus on that first.

    The below code removes the uniqueness time check that you were doing since it appears superfluous, and relies on the fact that the times are sequential instead of having to do a sort. In the end everything is the same as your code though, the data structures are just simplified and all the validation is done in a single loop

    use strict; use constant NULL_VAL => -999.9; my @data = (); while (<DATA>) { chomp; s/^\s+|\s+//g; next if $_ eq ''; push @data, [split ',']; } my %last_good = map {$_ => NULL_VAL} (1..$#{$data[0]}); # Go in reverse to fill in null data for (my $i=$#data; $i>=0; $i--) { # Validate each column for my $j (1..$#{$data[$i]}) { if ($data[$i][$j] == NULL_VAL) { $data[$i][$j] = $last_good{$j}; } else { $last_good{$j} = $data[$i][$j]; } } } foreach (@data) { print join(',', @$_) . "\n"; } # time,col1,col2 # (-999.9 out-of-range to indicate missing # and distinguish from true zeroes) __DATA__ 0.01,-999.9,1 0.02,-999.9,-999.9 0.03,-999.9,3 0.04,2,-999.9 0.05,-999.9,-999.9 0.06,5,4
    Note, this is still potentially fragile in a way though, since it requires loading everything in memory to do the update. But you don't say whether you're doing anything special with this data, like outputting it to a new file.