Re^5: statistics of a large text

in reply to Re^4: statistics of a large text
in thread statistics of a large text

Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):

#! /usr/bin/perl -w
use strict;

my $last_n_gram = "";
my @line_numbers;
while (<>) {
    my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/);
    if ($n_gram ne $last_n_gram and @line_numbers) {
        @line_numbers = sort {$a <=> $b} @line_numbers;
        print "$last_n_gram: @line_numbers\n";
        $last_n_gram = $n_gram;
        @line_numbers = ();
    }
    push @line_numbers, $line_number;
}
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
[download]

This assumes that you're going to reduce_step.pl intermediate_file > final_file.

In Section Seekers of Perl Wisdom