http://qs321.pair.com?node_id=11122010


in reply to regex search for words with one digit

Thank you, and forgive me my rather silly question.

  • Comment on Re: regex search for words with one digit

Replies are listed 'Best First'.
Re^2: regex search for words with one digit
by Bruder Savigny (Initiate) on Sep 21, 2020 at 20:40 UTC

    I guess many of us have asked "silly" questions (I certainly have, and still am). The documentation is often intimidatingly vast. The upshot here is that the regex character classes, anchors and the like may perfectly overlap, and you have unwittingly not only used a class that was more inclusive that you assumed ("word characters" -- well, who can blame you? I have ignored the 'numeric' part in 'alphanumeric' myself more than once because of the 'word'.), but were therefore also unaware that \d is not complementary to \w, but its subclass. (This is simply not what we intuitively expect.) And that - as tybalt has already demonstrated - negated character classes ([^ ... whatever ...]) are often massively helpful, especially if you follow Athanasius' suggestion and define the character classes yourself (which makes your own code more transparent to you).

    Guessing from how you've written the regex, it seems to me that the following advice may also be helpful:

    Perl does not "understand what you mean", but, like any other programming language, slavishly follows the rules you have given it, and, since rules are rules, does not need to be told things twice. With that in mind, you learn a lot if you try to design regexes (like other code) as "thinly" as possible:

    my @names = $text =~ /\w*\d\w*/g;

    does exactly the same as your original regex: matching all "words" (in the above definition) which contain at least one digit, somewhere. (Not what you wanted, I know, but it's still instructive.) Why?

    Although the \b anchors do match what you mean them to match, they are redundant: their meaning is "Match a \w\W or \W\w boundary" (man perlref). But your \w* already matches everything that falls under the definition of \w (* is "greedy", as you may know), and that is necessarily until it hits a character that does not - which is precisely the definition of \W. In other words, until it hits "a \w\W boundary". (As \d is a subclass of \w, it will never match anything that matches \W either; in other words, it will stop at a \d\W border, if there isn't a \w in between.)

    (Or am I somehow mistaken? Why has everybody else kept the \b?)

      Why has everybody else kept the \b?

      I can't answer for others, but for me, throwing in boundary assertions like this is a reflexive, defensive (and possibly cargo-cultish) move I tend to use when I'm dealing with "words".

      The string the Anonymous Monk gives as an example is fairly straightforward: it's delimited by whitespace and the beginning and end of the string. As you say, \b will not help here (update: No! See tybalt89's reply), though it does no harm.

      Unfortunately, "words" be tricky. Is "word's" one word or two? If it's supposed to be one word, then Anonymous Monk's /\b\w*\d\w*\b/ or /\w*\d\w*/ or, I think, any of the other solutions I've seen so far will fail to match it with or without \bs. Words like "t'other", "wouldn't've", "words'" or "left-handed" can be difficult to deal with. I'm sure one could give many other examples, and that's just in English!

      In general, I think (?<! \S) and (?! \S) would serve better than \b as word boundaries (update: in the OPed case). But once again, it is unfortunately true that there are few generalities in human language.


      Give a man a fish:  <%-{-{-{-<

        Once you add in the restriction of "only one digit", the \b is required.
        My

        my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b[^\W\d]*\d[^\W\d]*\b/g; print "@names\n";

        outputs

        P5ete Nic4k

        but without the \b's

        my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /[^\W\d]*\d[^\W\d]*/g; print "@names\n";

        it outputs

        P5ete Richard5 8 Nic4k Le7on 5

        It's pulling patterns out of the middle of "words".