in reply to Hash versus substitutation efficiency?
Option 1, especially if you can eliminate some words (a simple ranking of words by frequency will show you which ones to eliminate - such as "a", "an", "the", etc). Both methods are going to use a significant amount of memory and processor time to construct the hash and array, however, so this sort of thing should only be done as a batch process rather than a page by page run of your script. Some more details on what exactly you're trying to do and why might be helpful.
|