Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

How Could I Remove Stopwords using stopwords.txt?

by mynameisbob (Initiate)
on Dec 05, 2020 at 08:33 UTC ( [id://11124686]=perlquestion: print w/replies, xml ) Need Help??

mynameisbob has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone! Now I have extracted all sentences from htmls to Quotes.txt successfully. My next step is to remove all stopwords (using stopwords.txt) from the quotes.txt, and then count the number of occurrences each word appears and output this list to a text file with the most frequent words appearing at the top of the file. Now I am confused how to remove stopwords./p>

my $quotes="Quotes.txt"; open(OUTPUT, ">$quotes")||die("Could not open $quotes!"); my $page_num=1; while ($page_num<=10){ my $htmlpages="Page$page_num.html"; open (INPUT,"$htmlpages")||die("Could not open $htmlpages"); my $line=""; while ($line=<INPUT>){ if($line=~m/<span class="text" itemprop="text">(.+?)<\/span/ig){ my $quotes=$1; $quotes =~ s/I'm/I am/ig; $quotes =~ s/(\w+?)'re/$1 are/ig; $quotes =~ s/(\w+?)'s/$1 is/ig; $quotes =~ s/(\w+?)n't/$1 not/ig; $quotes =~ s/it's/it is/ig; $quotes =~ s/(\w+?)'ll/$1 will/ig; $quotes =~ s/I've/I have/ig; $quotes =~ s/won't/will not/ig; $quotes =~ s/can't/cannot/ig; $quotes =~ s/\&\#34;/'/ig; $quotes =~ s/\&\#39;/'/ig; $quotes =~ s/let's/let us/ig; $quotes =~ s/lady's/lady is/ig; print OUTPUT "$quotes\n"; } } $page_num=$page_num+1; close(INPUT); } close(OUTPUT); my $quotes_0="WordCount.txt"; my $quotes_1="Quotes.txt"; open(QUOTES,">$quotes_0")||die("Could not open $quotes_0"); my $stopwords="stopwords.txt"; open(WORDS,"$stopwords")||die("Could not open $stopwords"); open(OLD,"$quotes_1")||die("Could not open $quotes_1"); my $line1=""; while(my $stop=<WORDS>){ if($stop=~m/(.+?)/ig){ my $stopwords=$1; if ($line1=<OLD>){ if($line1=~s/\b($stopwords)\b//ig){ print QUOTES "$line1\n"; } } } } close(WORDS); close(OLD); close(QUOTES);

Replies are listed 'Best First'.
Re: How Could I Remove Stopwords using stopwords.txt?
by AnomalousMonk (Archbishop) on Dec 06, 2020 at 01:59 UTC
Re: How Could I Remove Stopwords using stopwords.txt?
by BillKSmith (Monsignor) on Dec 05, 2020 at 20:19 UTC
    It is usually a bad practice to parse HTML with a regex. It appears that you are extracting pure text (no markup) from one tag everywhere it occurs. Your regex approach is fine as long as that text is always all on one line. I do not see any problem with the way you expand contractions. Now you ask about removing 'stop words'. I have never heard this term, but I assume that you have a list of words (stored in the file stopwords.txt) which you want to filter out of your text. There are several ways you might do this. You have not given us enough information to make a recommendation. How long is your quotes.txt file? What exactly is a 'word'? How many words are in stopwords.txt? How is that file organized? One word per line?. What have you tried and how did it fail?
    Bill
Re: How Could I Remove Stopwords using stopwords.txt?
by Anonymous Monk on Dec 05, 2020 at 08:46 UTC
    Sample input data? Expected output?
Re: How Could I Remove Stopwords using stopwords.txt?
by perlfan (Vicar) on Dec 07, 2020 at 20:45 UTC
    The Lingua::EN (and other languages) have quite a few modules for stop words. I did a quick search and didn't see one that removed them, but there are modules that necessary tokenize the words and will allow you to easily filter them out.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11124686]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (2)
As of 2024-04-26 01:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found