How Could I Remove Stopwords using stopwords.txt?

mynameisbob has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone! Now I have extracted all sentences from htmls to Quotes.txt successfully. My next step is to remove all stopwords (using stopwords.txt) from the quotes.txt, and then count the number of occurrences each word appears and output this list to a text file with the most frequent words appearing at the top of the file. Now I am confused how to remove stopwords./p>

 my $quotes="Quotes.txt";
open(OUTPUT, ">$quotes")||die("Could not open $quotes!");
my $page_num=1;
while ($page_num<=10){
    my $htmlpages="Page$page_num.html";
    open (INPUT,"$htmlpages")||die("Could not open $htmlpages");
    my $line="";
    while ($line=<INPUT>){
    if($line=~m/<span class="text" itemprop="text">(.+?)<\/span/ig){
        my $quotes=$1;
         $quotes =~ s/I'm/I am/ig;
         $quotes =~ s/(\w+?)'re/$1 are/ig;
         $quotes =~ s/(\w+?)'s/$1 is/ig;
         $quotes =~ s/(\w+?)n't/$1 not/ig;
         $quotes =~ s/it's/it is/ig;
         $quotes =~ s/(\w+?)'ll/$1 will/ig;
         $quotes =~ s/I've/I have/ig;
         $quotes =~ s/won't/will not/ig;
         $quotes =~ s/can't/cannot/ig;
         $quotes =~ s/\&\#34;/'/ig;
         $quotes =~ s/\&\#39;/'/ig;
         $quotes =~ s/let's/let us/ig;
         $quotes =~ s/lady's/lady is/ig;
        print OUTPUT "$quotes\n";
        }
    }
    $page_num=$page_num+1;
    close(INPUT);
}
close(OUTPUT);

my $quotes_0="WordCount.txt";
my $quotes_1="Quotes.txt";
open(QUOTES,">$quotes_0")||die("Could not open $quotes_0");
my $stopwords="stopwords.txt";
open(WORDS,"$stopwords")||die("Could not open $stopwords");
open(OLD,"$quotes_1")||die("Could not open $quotes_1");
my $line1="";
while(my $stop=<WORDS>){
    if($stop=~m/(.+?)/ig){
        my $stopwords=$1;
        if ($line1=<OLD>){
        if($line1=~s/\b($stopwords)\b//ig){
        print QUOTES "$line1\n";
         }
    }
    }
}
 close(WORDS);
 close(OLD);
 close(QUOTES);
[download]

Comment on How Could I Remove Stopwords using stopwords.txt? Download Code

Replies are listed 'Best First'.
Re: How Could I Remove Stopwords using stopwords.txt? by AnomalousMonk (Archbishop) on Dec 06, 2020 at 01:59 UTC
Some other recent discussions of removing stopwords: Improving regular expression to remove stopwords, Filtering out stop words. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: How Could I Remove Stopwords using stopwords.txt? by BillKSmith (Monsignor) on Dec 05, 2020 at 20:19 UTC
It is usually a bad practice to parse HTML with a regex. It appears that you are extracting pure text (no markup) from one tag everywhere it occurs. Your regex approach is fine as long as that text is always all on one line. I do not see any problem with the way you expand contractions. Now you ask about removing 'stop words'. I have never heard this term, but I assume that you have a list of words (stored in the file stopwords.txt) which you want to filter out of your text. There are several ways you might do this. You have not given us enough information to make a recommendation. How long is your quotes.txt file? What exactly is a 'word'? How many words are in stopwords.txt? How is that file organized? One word per line?. What have you tried and how did it fail? Bill	[reply]
Re^2: How Could I Remove Stopwords using stopwords.txt? by marto (Cardinal) on Dec 05, 2020 at 20:23 UTC
Stop_word	[reply]
Re: How Could I Remove Stopwords using stopwords.txt? by Anonymous Monk on Dec 05, 2020 at 08:46 UTC
Sample input data? Expected output?	[reply]
Re: How Could I Remove Stopwords using stopwords.txt? by perlfan (Vicar) on Dec 07, 2020 at 20:45 UTC
The Lingua::EN (and other languages) have quite a few modules for stop words. I did a quick search and didn't see one that removed them, but there are modules that necessary tokenize the words and will allow you to easily filter them out.	[reply]


Don't ask to ask, just ask
	PerlMonks