Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: How Could I Remove Stopwords using stopwords.txt?

by BillKSmith (Prior)
on Dec 05, 2020 at 20:19 UTC ( #11124727=note: print w/replies, xml ) Need Help??


in reply to How Could I Remove Stopwords using stopwords.txt?

It is usually a bad practice to parse HTML with a regex. It appears that you are extracting pure text (no markup) from one tag everywhere it occurs. Your regex approach is fine as long as that text is always all on one line. I do not see any problem with the way you expand contractions. Now you ask about removing 'stop words'. I have never heard this term, but I assume that you have a list of words (stored in the file stopwords.txt) which you want to filter out of your text. There are several ways you might do this. You have not given us enough information to make a recommendation. How long is your quotes.txt file? What exactly is a 'word'? How many words are in stopwords.txt? How is that file organized? One word per line?. What have you tried and how did it fail?
Bill
  • Comment on Re: How Could I Remove Stopwords using stopwords.txt?

Replies are listed 'Best First'.
Re^2: How Could I Remove Stopwords using stopwords.txt?
by marto (Cardinal) on Dec 05, 2020 at 20:23 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11124727]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2021-04-11 01:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?