It is usually a bad practice to parse HTML with a regex. It appears that you are extracting pure text (no markup) from one tag everywhere it occurs. Your regex approach is fine as long as that text is always all on one line. I do not see any problem with the way you expand contractions. Now you ask about removing 'stop words'. I have never heard this term, but I assume that you have a list of words (stored in the file stopwords.txt) which you want to filter out of your text. There are several ways you might do this. You have not given us enough information to make a recommendation. How long is your quotes.txt file? What exactly is a 'word'? How many words are in stopwords.txt? How is that file organized? One word per line?. What have you tried and how did it fail?
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|