go ahead... be a heretic | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
G'day cinnamond, Welcome to the Monastery. [Aside: I see that it's already been pointed out that you've missed providing us with some details; the guidelines in "How do I post a question effectively?" have some more information about that. Otherwise, a well-presented first post: thankyou. Also, although it's not often directly related to the problem at hand, telling us your O/S and Perl version can result in a better answer from us (e.g. we might suggest a better, more recent Perl feature if we know you have a version that supports it).] Your code for getting the input file is fine. You might consider adding a second question to get an output filename. I put together what I considered a fairly challenging input file (lots of edge/corner cases) and hard-coded the filename.
I hadn't used Lingua::StopWords previously so I read the documentation. To be honest, I found it lacking in a number of respects: you can't add new stopwords; you can't remove existing stopwords that you don't want; you either specify a UTF-8 encoding or take whatever they give you. Take a look at the various language plugins in the Lingua-StopWords distribution if you haven't already done so. I added a _mod_stops() routine (in the code below) to address some of those issues; you can modify/extend that if you have other requirements. Working with CSV files has many gotchas: how to handle a field containing the separator character; how to quote a field containing a quote character; and so on. Writing your own code for this, unless as an academic exercise, is ill-advised. The Text::CSV module is robust, thoroughly tested, and addresses these issues: I strongly recommend you use it. It runs faster if you also have Text::CSV_XS installed, but that's optional. If you make a mistake like trying to use a string, instead of a single character, as a separator (as you did in your posted code) it will tell you about it. In the script below, I've included code to use Text::CSV: as you can see, it's very straightforward. Your example code shows using 'en' (English); I don't know if you have requirements for other languages. I hard-coded $lang but created a lookup table for language regexes, %word_re_for. That shows you some options; adapt according to your needs. I split the I/O parts of the code into two anonymous blocks. This means that filehandles are only open for the time they're needed. Perl automatically closes them at the end of those blocks: no need for close() statements. Perl also does the I/O exception handling for you via the autodie pragma: no need for '... or die "Can't whatever: $!";' all over the place. I'll also just mention that fc() is preferred over uc() and lc() when canonicalising strings for comparison. It requires Perl v5.16 — not knowing your Perl version, I didn't use it. (Refer back to the "Aside" at the top.) Here's the code.
Here's the output.
— Ken In reply to Re: Counting word frequency after StopWords removal
by kcott
|
|