http://qs321.pair.com?node_id=884570


in reply to Re^4: statistics of a large text
in thread statistics of a large text

Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
#! /usr/bin/perl -w use strict; my $last_n_gram = ""; my @line_numbers; while (<>) { my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/); if ($n_gram ne $last_n_gram and @line_numbers) { @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n"; $last_n_gram = $n_gram; @line_numbers = (); } push @line_numbers, $line_number; } @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n";
This assumes that you're going to reduce_step.pl intermediate_file > final_file.