Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
The standard way to do this type of text analysis on large bodies of text is to use MapReduce. The workflow for that looks like this:
  1. Take text, emit key/value pairs.
  2. Sort the key/value pairs by key then value.
  3. For each key, organize process the sorted values.
With a typical framework like Hadoop you only have to write the first and third steps, which are called Map and Reduce respectively. All three steps can also be distributed across multiple machines, allowing you to scale the work across a cluster.

In your example you can benefit from the same approach, even using just one machine, even without a framework.

Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)

What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)


In reply to Re: statistics of a large text by tilly
in thread statistics of a large text by perl_lover_always

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-03-28 09:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found