Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^3: Merging partially duplicate lines

by duelafn (Parson)
on Jan 31, 2016 at 08:03 UTC ( [id://1154111]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Merging partially duplicate lines
in thread Merging partially duplicate lines

I'm not sure I agree with the others about using a database. Generating a string key is generally easy enough and if your input are already sorted, you can process huge files without consuming unreasonable memory. You would need to be certain that they are in fact sorted and that their sorting matches the sorting you create in the parse_line function. A merge which keeps all keys in memory is a bit safer in that respect, but can blow up your RAM if the files are large.

#!/usr/bin/perl use strict; use warnings; use 5.014; open my $A, "<", "A" or die; open my $B, "<", "B" or die; sorted_merge($A, $B); # memory_merge($A, $B); sub sorted_merge { my @handle = @_; my @info; for my $fh (@handle) { my %h; @h{qw/key avg n/} = parse_line(scalar readline($fh)); push @info, \%h; } while (1) { # smallest key my ($next) = sort(grep defined($_), map $$_{key}, @info); last unless $next; my $sum = 0; my $n = 0; for my $i (0..$#handle) { next unless $info[$i]{key} and $info[$i]{key} eq $next; $sum += $info[$i]{avg} * $info[$i]{n}; $n += $info[$i]{n}; @{$info[$i]}{qw/key avg n/} = parse_line(scalar readline($ +handle[$i])); } next unless $n; print_line($next, $sum/$n, $n); } } sub memory_merge { my @handle = @_; my %data; for my $fh (@handle) { while (defined(my $line = <$fh>)) { my ($key, $avg, $n) = parse_line($line); if ($data{$key}) { $data{$key}{sum} += $avg * $n; $data{$key}{n} += $n; } else { $data{$key} = { sum => $avg * $n, n => $n, }; } } } for my $key (sort keys(%data)) { print_line($key, $data{$key}{sum}/$data{$key}{n}, $data{$key}{ +n}); } } sub print_line { my ($key, $avg, $n) = @_; my @cols = split /\s+/, $key; push @cols, $avg, $n; say join "\t", @cols; } sub parse_line { my $line = shift; return unless $line; my @col = split /\s+/, $line; # Format the key so that they sort correctly as strings. # Choose padding sizes carefully. my $key = sprintf "%-5s %4d %-10s %-10s", @col[0..3]; my $avg = $col[4]; my $n = $col[5]; return ($key, $avg, $n); }

Good Day,
    Dean

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1154111]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-26 05:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found