Re: Merging partially duplicate lines

Your samples appear to be sorted. If this is true of your actual data you wouldn't need to keep much state and could just use a merging algorithm, comparing only the first line of each file.

Comment: This looks like you might be collating statistical responses (column 5 is average, column 6 is response count?). If so, would you want the weighted average rather than just regular average? (e.g., column 5 from first lines would be (6 * 0.25 + 0.30 * 8) / (6 + 8) = 0.27857). Of course, I have no idea what you are actually trying to do so feel free to ignore this if I'm mis-interpreting.

Good Day,
Dean

Comment on Re: Merging partially duplicate lines

Replies are listed 'Best First'.
Re^2: Merging partially duplicate lines by K_Edw (Beadle) on Jan 30, 2016 at 21:36 UTC
That is remarkably similar to the actual purpose of this - I would indeed require a weighted average but I thought it best to figure out the basics first. My data is sorted but some lines will simply be missing from some files.	[reply]
Re^3: Merging partially duplicate lines by poj (Abbot) on Jan 30, 2016 at 21:41 UTC
Here's a database solution showing the flexibility by adding the weighted average `#!perl use strict; use DBI; # create table my $dbh = create_db('database.sqlite'); # load data my @files = qw(fileA.txt fileB.txt); for my $file (@files){ load_db($dbh,$file); } # report my $query = 'SELECT A,B,C,D,AVG(E),SUM(F), MIN(E),MAX(E),COUNT(),SUM(EF)/SUM(F) FROM test GROUP BY A,B,C,D ORDER BY A,B,C,D'; report($dbh,$query);` [download] Read more... (1245 Bytes) poj	[reply] [d/l] [select]
Re^3: Merging partially duplicate lines by duelafn (Parson) on Jan 31, 2016 at 08:03 UTC
I'm not sure I agree with the others about using a database. Generating a string key is generally easy enough and if your input are already sorted, you can process huge files without consuming unreasonable memory. You would need to be certain that they are in fact sorted and that their sorting matches the sorting you create in the `parse_line` function. A merge which keeps all keys in memory is a bit safer in that respect, but can blow up your RAM if the files are large. #!/usr/bin/perl use strict; use warnings; use 5.014; open my $A, "<", "A" or die; open my $B, "<", "B" or die; sorted_merge($A, $B); # memory_merge($A, $B); sub sorted_merge { my @handle = @_; my @info; for my $fh (@handle) { my %h; @h{qw/key avg n/} = parse_line(scalar readline($fh)); push @info, \%h; } while (1) { # smallest key my ($next) = sort(grep defined($_), map $$_{key}, @info); last unless $next; my $sum = 0; my $n = 0; for my $i (0..$#handle) { next unless $info[$i]{key} and $info[$i]{key} eq $next; $sum += $info[$i]{avg} * $info[$i]{n}; $n += $info[$i]{n}; @{$info[$i]}{qw/key avg n/} = parse_line(scalar readline($ +handle[$i])); } next unless $n; print_line($next, $sum/$n, $n); } } sub memory_merge { my @handle = @_; my %data; for my $fh (@handle) { while (defined(my $line = <$fh>)) { my ($key, $avg, $n) = parse_line($line); if ($data{$key}) { $data{$key}{sum} += $avg * $n; $data{$key}{n} += $n; } else { $data{$key} = { sum => $avg * $n, n => $n, }; } } } for my $key (sort keys(%data)) { print_line($key, $data{$key}{sum}/$data{$key}{n}, $data{$key}{ +n}); } } sub print_line { my ($key, $avg, $n) = @_; my @cols = split /\s+/, $key; push @cols, $avg, $n; say join "\t", @cols; } sub parse_line { my $line = shift; return unless $line; my @col = split /\s+/, $line; # Format the key so that they sort correctly as strings. # Choose padding sizes carefully. my $key = sprintf "%-5s %4d %-10s %-10s", @col[0..3]; my $avg = $col[4]; my $n = $col[5]; return ($key, $avg, $n); } [download] Good Day, Dean	[reply] [d/l] [select]


"be consistent"
	PerlMonks