Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Using Word Tokens as Features

by Anonymous Monk
on Apr 11, 2013 at 22:24 UTC ( [id://1028242]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I'm fairly new to Perl and I have a question about how to construct features using word tokens. Essentially, I have messages like "I like you." and I'd like to separate that into its word tokens, I, like, and you. Then, each word would represent a column feature and messages would have how many instances of that word token they have as their value. In this case, this message would have 1 1 1. I'd like to do this for a bunch of messages (and they would all share these features, meaning a lot would be 0). In a different programming language, I would iterate over each message, constructing columns as I go along, and finding how many of each word there is. However, with a lot of messages, I'm looking for a faster way to do this. Is there an easy way in Perl, or is there already code out there to do this? Thanks!

Replies are listed 'Best First'.
Re: Using Word Tokens as Features
by bioinformatics (Friar) on Apr 11, 2013 at 22:56 UTC
    The faster way to do this is to store the word as a key in a hash, with a hash linked to an array that you can append a value to as you go along. This saves you the time of building a 2D matrix. If you parse each line by a delimiter, then you have a collection of tokens (strings) that you can easily add to the hash of arrays. The catch is you have to keep a counter to make sure that you correctly initialize the new arrays to the appropriate length.
Re: Using Word Tokens as Features
by igelkott (Priest) on Apr 11, 2013 at 22:56 UTC

    Rather than building up the "columns" (almost certainly a "hash" in Perl) one new word at a time, I'd suggest that you start out with a reasonably complete corpus for the language/subject you wish to cover. This will make it a bit easier to maintain counts between messages and the number of zero columns for a particular message won't depend on the order it was processed (otherwise, your first message would have no zero columns).

    As for the word counting task itself, look into split and hash. The basic procedure is rather simple but you'll need to decide how to handle case, hyphenation and maybe even stemming.

Re: Using Word Tokens as Features
by educated_foo (Vicar) on Apr 11, 2013 at 23:14 UTC
    First normalize your input, e.g. s/\W+/ /g.

    Then count up words for each message using a hash table, and add those together for your whole corpus. From there, you should be able to calculate TF/IDF scores, which sounds like a homework problem.

      ... and perhaps converting upper to lower...


Re: Using Word Tokens as Features
by hdb (Monsignor) on Apr 12, 2013 at 06:46 UTC
Re: Using Word Tokens as Features
by Anonymous Monk on Apr 12, 2013 at 03:33 UTC
Re: Using Word Tokens as Features
by Anonymous Monk on Apr 14, 2013 at 07:13 UTC
    Thank you so much for your guys' help!

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1028242]
Approved by ww
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (8)
As of 2024-04-19 12:27 GMT
Find Nodes?
    Voting Booth?

    No recent polls found