in reply to statistics of a large text
- Take text, emit key/value pairs.
- Sort the key/value pairs by key then value.
- For each key,
organizeprocess the sorted values.
In your example you can benefit from the same approach, even using just one machine, even without a framework.
Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)
What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Jan 26, 2011 at 15:25 UTC | |
by tilly (Archbishop) on Jan 26, 2011 at 15:44 UTC | |
by perl_lover_always (Acolyte) on Jan 26, 2011 at 17:13 UTC | |
by perl_lover_always (Acolyte) on Jan 27, 2011 at 09:59 UTC | |
by tilly (Archbishop) on Jan 27, 2011 at 15:05 UTC | |
by perl_lover_always (Acolyte) on Feb 10, 2011 at 11:13 UTC | |
| |
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Jan 31, 2011 at 14:51 UTC | |
by tilly (Archbishop) on Feb 01, 2011 at 07:21 UTC | |
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Feb 08, 2011 at 11:05 UTC | |
by tilly (Archbishop) on Feb 08, 2011 at 15:32 UTC |