comment on

I guess many of us have asked "silly" questions (I certainly have, and still am). The documentation is often intimidatingly vast. The upshot here is that the regex character classes, anchors and the like may perfectly overlap, and you have unwittingly not only used a class that was more inclusive that you assumed ("word characters" -- well, who can blame you? I have ignored the 'numeric' part in 'alphanumeric' myself more than once because of the 'word'.), but were therefore also unaware that \d is not complementary to \w, but its subclass. (This is simply not what we intuitively expect.) And that - as tybalt has already demonstrated - negated character classes ([^ ... whatever ...]) are often massively helpful, especially if you follow Athanasius' suggestion and define the character classes yourself (which makes your own code more transparent to you).

Guessing from how you've written the regex, it seems to me that the following advice may also be helpful:

Perl does not "understand what you mean", but, like any other programming language, slavishly follows the rules you have given it, and, since rules are rules, does not need to be told things twice. With that in mind, you learn a lot if you try to design regexes (like other code) as "thinly" as possible:

my @names = $text =~ /\w*\d\w*/g;
[download]

does exactly the same as your original regex: matching all "words" (in the above definition) which contain at least one digit, somewhere. (Not what you wanted, I know, but it's still instructive.) Why?

Although the \b anchors do match what you mean them to match, they are redundant: their meaning is "Match a \w\W or \W\w boundary" (man perlref). But your \w* already matches everything that falls under the definition of \w (* is "greedy", as you may know), and that is necessarily until it hits a character that does not - which is precisely the definition of \W. In other words, until it hits "a \w\W boundary". (As \d is a subclass of \w, it will never match anything that matches \W either; in other words, it will stop at a \d\W border, if there isn't a \w in between.)

(Or am I somehow mistaken? Why has everybody else kept the \b?)

In reply to Re^2: regex search for words with one digit by Bruder Savigny
in thread regex search for words with one digit by Anonymous Monk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks