comment on

As I commented over in Building Regex Alternations Dynamically, it's possible to generate a regex from the entirety of /usr/share/dict/words, which on my system currently has over 100,000 entries, resulting in a regex that has a string length of 1MB. Matching against that regex is still relatively performant. So building a regex in the way you showed is possible; whether it's the best solution in your case probably depends on how many matches you'll be doing with that regex, and you'll have to measure the performance in your use case. I would recommend that loadCommonWords should return a regex precompiled with qr// instead of a string, and that you sort @commonwords by length, as I showed in the aforementioned thread.

Update: Eily is right, I overlooked the anchors: for exact string matches, definitely use a hash instead.

In reply to Re: Filtering out stop words (updated) by haukex
in thread Filtering out stop words by IB2017

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks