comment on

It is usually a bad practice to parse HTML with a regex. It appears that you are extracting pure text (no markup) from one tag everywhere it occurs. Your regex approach is fine as long as that text is always all on one line. I do not see any problem with the way you expand contractions. Now you ask about removing 'stop words'. I have never heard this term, but I assume that you have a list of words (stored in the file stopwords.txt) which you want to filter out of your text. There are several ways you might do this. You have not given us enough information to make a recommendation. How long is your quotes.txt file? What exactly is a 'word'? How many words are in stopwords.txt? How is that file organized? One word per line?. What have you tried and how did it fail?

Bill

In reply to Re: How Could I Remove Stopwords using stopwords.txt? by BillKSmith
in thread How Could I Remove Stopwords using stopwords.txt? by mynameisbob

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Just another Perl shrine
	PerlMonks