Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: retrive data from another file by comparing the values

by eighty-one (Curate)
on Oct 06, 2008 at 13:25 UTC ( #715567=note: print w/replies, xml ) Need Help??


in reply to retrive data from another file by comparing the values

It would be easier to offer advice if you provided more information, like what the problem you are trying to solve is, and what you've tried so far. The snippet you posted would make more sense with some context. Although it's possible to read it and understand what it's doing, it's easier to do so if we know WHY it's doing what it's doing.

Also with a description of the problem, you might be offered a better solution than what you're pursuing; quite often I'll see a writeup and think, "Why, that's easy, you just do X!". I'll go into the node to reply as such, and someone else will have said "Why, that's easy! You just do Y!" - and it turns out Y is much easier and makes more sense than X.

A sample of your data and a description of what it is and what it means would probably be helpful, too.
  • Comment on Re: retrive data from another file by comparing the values

Replies are listed 'Best First'.
Re^2: retrive data from another file by comparing the values
by amshi (Initiate) on Oct 06, 2008 at 17:10 UTC
    Hi, In this program, this is the code:
    open(IN, "/path/outModified.pl") or die "$!"; while (my $line = <IN>) { chomp($line); my @array = (split (/\s+/, $line))[6, 7, 8]; # print "@array\n"; push @points, [ @array ]; } close(IN); print '@points : ', Dumper \@points; open my $out_file, '>', "/path/dist_modified.pl" or die "cannot open: +$!\n"; for my $i1 ( 0 .. $#points -1 ){ my ( $x1, $y1, $z1 ) = @{ $points[$i1] }; for my $i2 ( 1 .. $#points){ my ( $x2, $y2, $z2 ) = @{ $points[$i2] }; my $dist = sqrt(($x2 - $x1)**2 + ($y2 - $y1)**2 + ($z2 - $z1)**2); print $out_file "153L.pdb: the distance between CA $i1 and CA $i2 = $d +ist\n"; } } close $out_file or die "cannot close file: $!\n";
    Here, I am opening a file outModified.pl which is in the form
    ATOM 2 CA ARG A 1 6.324 32.707 50.379 ATOM 13 CA THR A 2 5.197 32.618 46.826 ATOM 20 CA ASP A 3 4.020 36.132 46.259 ATOM 28 CA CYS A 4 7.131 38.210 45.919 ATOM 34 CA TYR A 5 6.719 38.935 42.270 ATOM 46 CA GLY A 6 2.986 39.221 41.892 ATOM 50 CA ASN A 7 -0.269 37.184 41.565 ATOM 58 CA VAL A 8 -1.140 35.549 38.341 ATOM 65 CA ASN A 9 -4.817 35.710 39.211 ATOM 73 CA ARG A 10 -4.704 39.489 39.013 ATOM 84 CA ILE A 11 -3.243 39.663 35.555
    I am extracting the last three columns(this is the format of the .pdb file from protein data bank).the last three columns are X , Y and Z coordinates. I am extracting these three columns, putting them in an array and calculating the distance using this formula. ($dist = sqrt(($x2 - $x1)**2 + ($y2 - $y1)**2 + ($z2 - $z1)**2))

    Now, the distance is being calculated for over 300 values/300 cordinate values and the output is printed to an output file, but before printing it to the output file, I want to compare the distance($dist) to another file which has the data like this:

    Index Distance 1 2.69 2 2.97032 3 3.25064 4 3.53096 5 3.81128 6 4.0916 7 4.37192 8 4.65224 9 4.93256 10 5.21288 11 5.4932 12 5.77352 13 6.05384 14 6.33416 15 6.61448 16 6.8948 17 7.17512 18 7.45544 19 7.73576 20 8.01608 21 8.2964 22 8.57672 23 8.85704 24 9.13736 25 9.41768 26 9.698 27 9.97832 28 10.25864 29 10.53896 30 10.81928 31 11.0996 32 11.37992 33 11.66024 34 11.94056 35 12.22088 36 12.5012 37 12.78152 38 13.06184 39 13.34216 40 13.62248 41 13.9028 42 14.18312 43 14.46344 44 14.74376 45 15.02408 46 15.3044 47 15.58472 48 15.86504 49 16.14536 50 16.42568 51 16.706 52 16.98632 53 17.26664 54 17.54696 55 17.82728 56 18.1076 57 18.38792 58 18.66824 59 18.94856 60 19.22888 61 19.5092 62 19.78952 63 20.06984 64 20.35016 65 20.63048 66 20.9108 67 21.19112 68 21.47144 69 21.75176 70 22.03208 71 22.3124 72 22.59272 73 22.87304 74 23.15336 75 23.43368 76 23.714 77 23.99432 78 24.27464 79 24.55496 80 24.83528 81 25.1156 82 25.39592 83 25.67624 84 25.95656 85 26.23688 86 26.5172 87 26.79752 88 27.07784 89 27.35816 90 27.63848 91 27.9188 92 28.19912 93 28.47944 94 28.75976 95 29.04008 96 29.3204 97 29.60072 98 29.88104 99 30.16136 100 30.44168 101 30.722 102 31.00232 103 31.28264 104 31.56296 105 31.84328 106 32.1236 107 32.40392 108 32.68424 109 32.96456 110 33.24488 111 33.5252 112 33.80552 113 34.08584 114 34.36616 115 34.64648 116 34.9268 117 35.20712 118 35.48744 119 35.76776 120 36.04808 121 36.3284 122 36.60872 123 36.88904 124 37.16936 125 37.44968 126 37.73 127 38.01032 128 38.29064 129 38.57096 130 38.85128 131 39.1316 132 39.41192 133 39.69224 134 39.97256 135 40.25288 136 40.5332 137 40.81352 138 41.09384 139 41.37416 140 41.65448 141 41.9348 142 42.21512 143 42.49544 144 42.77576 145 43.05608 146 43.3364 147 43.61672 148 43.89704 149 44.17736 150 44.45768 151 44.738 152 45.01832 153 45.29864 154 45.57896 155 45.85928 156 46.1396 157 46.41992 158 46.70024 159 46.98056 160 47.26088 161 47.5412 162 47.82152 163 48.10184 164 48.38216 165 48.66248 166 48.9428 167 49.22312 168 49.50344 169 49.78376 170 50.06408 171 50.3444 172 50.62472 173 50.90504 174 51.18536 175 51.46568 176 51.746 177 52.02632 178 52.30664 179 52.58696 180 52.86728 181 53.1476 182 53.42792 183 53.70824
    I have to compare the $dist with the 2nd column(distance value/floating point value), get the nearest high value, and print the corresponding first column value. I have to make sure, that if the $dist value is >= 50.000, then it should print the first column index of 50.0000.its the threshold value/set the variable value to 50.000. the second column values are the distance values and the first column values are the index values. I have to open this file, compare or match the (already calculated) $dist value with the floating point values in the second column and print the correspoinding first column value in the output file. this is the problem.

      OK, looking at the full table, using List::Util::first may get tedious if you're doing a lot of this. But the binary chop I suggested may be too general purpose.

      Observing that the "distance" values are in a small range (0..54) it seemed that the complexity of the binary chop could be saved, by using int(distance) to look up a short list of distances. It further occured to me that using int(distance * 2**n) could reduce this to almost a direct lookup.

      The code below shows the variants attempted.

      I was curious how much real difference this made, so I got out Benchmark to have a look. The variants are:

        table_a   -- linear search of list of distances, using List::Util::first
        table_A   -- same as table_a, but hand cranking (a foreach loop)
        table_b   -- binary chop
        table_h   -- look up by int(v) and scan short list
        table_hh  -- look up by int(v * 2**n) and choose
      
      the results are:
        Running benchmark over test set of 10,000 values, count=100.
                   Rate table_a  table_A  table_b  table_h  table_hh
        table_a  2.37/s       --     -18%     -69%     -88%     -93%
        table_A  2.88/s      22%       --     -62%     -85%     -91%
        table_b  7.59/s     221%     164%       --     -60%     -77%
        table_h  19.0/s     702%     559%     150%       --     -42%
        table_hh 32.9/s    1291%    1043%     334%      73%       --
      
      So what does this tell us ?
      1. the linear search is a trivial amount of code, and will crunch through 23,700 values in a second -- so you may never care that there are faster ways !
      2. the binary chop is high tone stuff, but three times faster than the linear search is frankly disappointing, given the expenditure of brain-power -- of course, if the number of distances were to double, this would cope !
      3. the look ups are rather faster, and the direct look up is 13 times faster -- if you had millions of these values to process, you'd be glad of that. Constructing the necessary tables is a bit of code, but not too taxing.
      so in general terms, we are reminded that when optimising:
      1. don't even start until you have a problem !
      2. general purpose algorithms are fine, but it's worth looking at the data to see if that simplifies things.

      I was expecting to also remark that looking for an appropriate module will help. In this case, however, List::Util::first is slower :-(


      use strict ; use warnings ; use List::Util qw(first) ; use Benchmark () ; # Suck table into @data, from which will construct various tables my @data = () ; while (<DATA>) { my ($r, $v) = m/^\s*(\S+)\s+(\S+)\s*$/ or die "invalid data" ; push @data, [$r, $v] ; } ; @data = sort { $a->[1] <=> $b->[1] ; } @data ; # make sure sorted # @table_a is for straight linear search my @table_a = reverse @data ; push @table_a, [0, 0] ; # @table_b is for binary chop my @table_b = @data ; # @table_h is for lookup by int($v), followed by linear search # # Each entry is [$r0] -> $r0 is the result # or [$r0, $v1, $r1] -> $r0 is the result, # unless $v >= $v1, wh +en $r1 is the result # or [$r0, $v1, $r1, $v2, $r2] -> $r0 is the result, # unless $v >= $v1, whe +n $r1 is the result # unless $v >= $v2, whe +n $r2 is the result # and so on my @table_h = () ; my $ip = -1 ; my $rp = 0 ; foreach my $d (@data) { my ($r, $v) = @$d ; die if $v < 0 ; # Not ready for -ve values in + table my $i = int($v) ; while ($ip < $i) { $table_h[++$ip] = [$rp] ; } ; push @{$table_h[$i]}, $v, $r ; $rp = $r ; } ; $table_h[++$ip] = [$rp] ; # Dummy entry for larger $v # @table_hh is for lookup by int($v * $spread_hh) followed by single t +est # # Each entry is [$r0, $v -> $r0 is the result # or [$r0, $v1, $r1] -> $r0 is the result, # unless $v >= $v1, wh +en $r1 is the result # # We increase $spread_hh until the above is achieved. # # NB: $spread_hh is a power of 2, which guarantees that $v * $spread_h +h is exact. my @table_hh ; my $spread_hh = 1 ; FILL: while (1) { @table_hh = () ; $spread_hh *= 2 ; my $ip = -1 ; my $rp = 0 ; foreach my $d (@data) { my ($r, $v) = @$d ; die if $v < 0 ; # Not ready for -ve values +in table my $i = int($v * $spread_hh) ; next FILL if ($i == $ip) ; while ($ip < $i) { $table_hh[++$ip] = [$rp, $v] ; } ; push @{$table_hh[$i]}, $r ; $rp = $r ; } ; $table_hh[++$ip] = [$rp] ; # Dummy entry for larger $v last ; } ; # Quick test to ensure all methods work my @test = (2, 2.69, 2.7, 3.9, 10.8, 17.9, 38.01033, 50, 51, 53.5, 2 +00) ; my @ok = (0, 1, 1, 5, 29, 55, 127, 169, 173, 182, 1 +83) ; foreach my $t (@test) { my $r_A = whack_A($t) ; my $r_a = whack_a($t) ; my $r_b = whack_b($t) ; my $r_h = whack_h($t) ; my $r_hh = whack_hh($t) ; my $o = shift(@ok) ; my $s = "" ; if ($r_A != $o) { $s .= " BUT \$r_A=$r_A" ; } ; if ($r_a != $o) { $s .= " BUT \$r_a=$r_a" ; } ; if ($r_b != $o) { $s .= " BUT \$r_b=$r_b" ; } ; if ($r_h != $o) { $s .= " BUT \$r_h=$r_h" ; } ; if ($r_hh != $o) { $s .= " BUT \$r_hh=$r_hh" ; } ; printf "%7.3f -> %4d %s\n", $t, $o, $s || "OK" ; } ; # Larger test set, testing methods against each other ! srand 6567531764 ; # same test set each time ! my @full_test = () ; my $full_test = 10000 ; for (1..$full_test) { push @full_test, rand(56) ; } ; print "Running consistency test over test set of $full_test values" ; my $fail = 0 ; foreach my $t (@full_test) { my $r_A = whack_A($t) ; my $r_a = whack_a($t) ; my $r_b = whack_b($t) ; my $r_h = whack_h($t) ; my $r_hh = whack_hh($t) ; if (($r_a != $r_A) || ($r_b != $r_A) || ($r_h != $r_A) || ($r_hh != + $r_A)) { printf "\n%7.3f \$r_A=%3d, \$r_a=%3d, \$r_b=%3d, \$r_h=%3d, \$r_hh +=%3d", $t, $r_A, $r_a, $r_b, $r_h, $r_hh + ; $fail++ ; } ; } ; if ($fail) { die " -- $fail failures\n" ; } ; printf " -- OK\n" ; # Benchmark the methods my $bench = 100 ; print "Running benchmark over test set of $full_test values, count=$be +nch.\n" ; Benchmark::cmpthese($bench, { 'table_A ' => sub { whack_A($_) for (@full_test) }, 'table_a ' => sub { whack_a($_) for (@full_test) }, 'table_b ' => sub { whack_b($_) for (@full_test) }, 'table_h ' => sub { whack_h($_) for (@full_test) }, 'table_hh' => sub { whack_hh($_) for (@full_test) }, }); #--------------------------------------------------------------------- +-------------------- # whack_a: linear search in @table_a, using List::Util::first. sub whack_a { my ($v) = @_ ; my $r = first { $_->[1] <= $v } @table_a ; return $r->[0] ; # return defined($r) ? $r->[0] : 0 ; } ; #--------------------------------------------------------------------- +-------------------- # whack_A: linear search in @table_a, hand cranked. sub whack_A { my ($v) = @_ ; foreach (@table_a) { return $_->[0] if ($_->[1] <= $v) ; } ; return 0 ; } ; #--------------------------------------------------------------------- +-------------------- # whack_hh: lookup $table_hh[int($v * $spread_hh)] and choose result. sub whack_hh { my ($v) = @_ ; my $i = int($v * $spread_hh) ; if ($i <= 0) { return $table_hh[ 0]->[0] ; } ; if ($i >= $#table_hh) { return $table_hh[-1]->[0] ; } ; my $rt = $table_hh[$i] ; return $v < $rt->[1] ? $rt->[0] : $rt->[2] ; } ; #--------------------------------------------------------------------- +-------------------- # whack_h: lookup $table_h[int($v)] and linear scan for result sub whack_h { my ($v) = @_ ; my $i = int($v) ; if ($i <= 0) { return $table_h[ 0]->[0] ; } ; if ($i >= $#table_h) { return $table_h[-1]->[0] ; } ; my $rt = $table_h[$i] ; $i = 0 ; while ($i < $#$rt) { last if $v < $rt->[$i+1] ; $i += 2 } ; return $rt->[$i] ; } ; #--------------------------------------------------------------------- +-------------------- # whack_b: binary chop @table_b sub whack_b { my ($v) = @_ ; my $rt = $table_b[0] ; if ($rt->[1] > $v) { return 0 ; } # quit if not in table my $h = $#table_b ; my $l = 0 ; my $i ; while ($l < $h) { $rt = $table_b[$i = ($l + $h + 1) >> 1] ; # new mid index and val +ue if ($rt->[1] <= $v) { last if $rt->[1] == $v ; # quit if exact match $l = $i ; # entry[$i] < $v -- move search u +p } else { $h = --$i ; # entry[$i] > $v -- move search + down past it } ; } ; return $table_b[$i]->[0] ; } ; #--------------------------------------------------------------------- +-------------------- # The data ! __DATA__ 1 2.69 2 2.97032 3 3.25064 4 3.53096 5 3.81128 6 4.0916 7 4.37192 8 4.65224 9 4.93256 10 5.21288 11 5.4932 12 5.77352 13 6.05384 14 6.33416 15 6.61448 16 6.8948 17 7.17512 18 7.45544 19 7.73576 20 8.01608 21 8.2964 22 8.57672 23 8.85704 24 9.13736 25 9.41768 26 9.698 27 9.97832 28 10.25864 29 10.53896 30 10.81928 31 11.0996 32 11.37992 33 11.66024 34 11.94056 35 12.22088 36 12.5012 37 12.78152 38 13.06184 39 13.34216 40 13.62248 41 13.9028 42 14.18312 43 14.46344 44 14.74376 45 15.02408 46 15.3044 47 15.58472 48 15.86504 49 16.14536 50 16.42568 51 16.706 52 16.98632 53 17.26664 54 17.54696 55 17.82728 56 18.1076 57 18.38792 58 18.66824 59 18.94856 60 19.22888 61 19.5092 62 19.78952 63 20.06984 64 20.35016 65 20.63048 66 20.9108 67 21.19112 68 21.47144 69 21.75176 70 22.03208 71 22.3124 72 22.59272 73 22.87304 74 23.15336 75 23.43368 76 23.714 77 23.99432 78 24.27464 79 24.55496 80 24.83528 81 25.1156 82 25.39592 83 25.67624 84 25.95656 85 26.23688 86 26.5172 87 26.79752 88 27.07784 89 27.35816 90 27.63848 91 27.9188 92 28.19912 93 28.47944 94 28.75976 95 29.04008 96 29.3204 97 29.60072 98 29.88104 99 30.16136 100 30.44168 101 30.722 102 31.00232 103 31.28264 104 31.56296 105 31.84328 106 32.1236 107 32.40392 108 32.68424 109 32.96456 110 33.24488 111 33.5252 112 33.80552 113 34.08584 114 34.36616 115 34.64648 116 34.9268 117 35.20712 118 35.48744 119 35.76776 120 36.04808 121 36.3284 122 36.60872 123 36.88904 124 37.16936 125 37.44968 126 37.73 127 38.01032 128 38.29064 129 38.57096 130 38.85128 131 39.1316 132 39.41192 133 39.69224 134 39.97256 135 40.25288 136 40.5332 137 40.81352 138 41.09384 139 41.37416 140 41.65448 141 41.9348 142 42.21512 143 42.49544 144 42.77576 145 43.05608 146 43.3364 147 43.61672 148 43.89704 149 44.17736 150 44.45768 151 44.738 152 45.01832 153 45.29864 154 45.57896 155 45.85928 156 46.1396 157 46.41992 158 46.70024 159 46.98056 160 47.26088 161 47.5412 162 47.82152 163 48.10184 164 48.38216 165 48.66248 166 48.9428 167 49.22312 168 49.50344 169 49.78376 170 50.06408 171 50.3444 172 50.62472 173 50.90504 174 51.18536 175 51.46568 176 51.746 177 52.02632 178 52.30664 179 52.58696 180 52.86728 181 53.1476 182 53.42792 183 53.70824

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://715567]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2021-10-21 20:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (84 votes). Check out past polls.

    Notices?