http://qs321.pair.com?node_id=894701

astrobio has asked for the wisdom of the Perl Monks concerning the following question:

I have to fill in missing values in a sparse matrix of 50 columns and millions of rows. The missing values are filled importantly from the next good value in a column. The physical problem is from a signal generator where each column has a different sample rate. Any suggestions on how to make the following fragile code more robust to backfill missing values?
while (<DATA>){ s/^\s+|\s+//g; if(length($_)<1){ next;} @f=split(/,/,$_); push(@col1,{TIME=>$f[0],GOOD=>$f[1]}); push(@col2,{TIME=>$f[0],GOOD=>$f[2]}); if($unique{$f[0]}!=1){ push(@times,$f[0]); $unique{$f[0]}=1; } } # reverse sort by col1 values to backfill # forward sort by col1 time to catch last good foreach(sort {$$b{GOOD}<=>$$a{GOOD} || $$a{TIME}<=>$$b{TIME}} @col1){ if( ($$_{TIME}<$last_time) && ($$_{GOOD} == -999.9) ) { $out1{$$_{TIME}}=$last_good; }elsif($$_{GOOD} == -999.9){ $out1{$$_{TIME}}=$last_good; }else{ $last_time=$$_{TIME}; $last_good=$$_{GOOD}; $out1{$$_{TIME}}=$$_{GOOD}; } } foreach(sort {$$b{GOOD}<=>$$a{GOOD} && $$a{TIME}<=>$$b{TIME}} @col2){ if( ($$_{TIME}<$last_time) && ($$_{GOOD} == -999.9) ) { $out2{$$_{TIME}}=$last_good; }elsif($$_{GOOD} == -999.9){ $out2{$$_{TIME}}=$last_good; }else{ $last_time=$$_{TIME}; $last_good=$$_{GOOD}; $out2{$$_{TIME}}=$$_{GOOD}; } } foreach $time (@times){ print "$time,$out1{$time},$out2{$time}\n"; } # time,col1,col2 # (-999.9 out-of-range to indicate missing # and distinguish from true zeroes) __DATA__ 0.01,-999.9,1 0.02,-999.9,-999.9 0.03,-999.9,3 0.04,2,-999.9 0.05,-999.9,-999.9 0.06,5,4
which currently gives output
# time,col1,col2 0.01,2,1 0.02,2,1 0.03,2,3 0.04,2,1 0.05,2,1 0.06,5,4

It seems possible to do with just one data pass if the missing values are filled from the last good value only, and the first row was guaranteed filled to start.