For previous discussion of this problem see statistics of a large text
and Reaped: a large text file into hash
As I pointed out to you in the previous discussions, this is likely to be slow. The next step that I suggested is to parallelize work with Hadoop. Have you tried that yet?