Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

regex search for words with one digit

by Anonymous Monk
on Sep 21, 2020 at 15:28 UTC ( #11122003=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a string with names in it, some of these names have digits in them. So i want to find all the names with exactly 1 digit. f.e. Bill3, but not Bill33. So i try to use a regex for this. Here is my code:

use strict; use warnings; my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b\w*\d\w*\b/g; print "@names\n";

It outputs: P5ete Richard58 Nic4k Le7on5
This should be: P5ete Nic4k
Maybe someone can tell me why this is? Ty.

Replies are listed 'Best First'.
Re: regex search for words with one digit
by Athanasius (Bishop) on Sep 21, 2020 at 16:03 UTC

    The character class \w matches an alphanumeric character, so it matches a digit as well as a letter (or underscore). You need a character class which excludes digits. But \D includes anything not a digit, so it matches whitespace. A negated character class [^\d\s] will match a character that is neither a digit nor a space:

    my @names = $text =~ /\b[^\d\s]*\d[^\d\s]*\b/g;

    Or, more simply, specify the letters you want to match explicitly (note the /i modifier to make the regex case-insensitive):

    my @names = $text =~ /\b[A-Z]*\d[A-Z]*\b/gi;

    See the section “Character Classes and other Special Escapes” in perlre#Regular-Expressions.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      /\b[^\d\s]*\d[^\d\s]*\b/
      /\b[A-Z]*\d[A-Z]*\b/i

      Note that these match the name '1'.

      Update: Note also that [^\d\s] matches stuff like % & = - / .


      Give a man a fish:  <%-{-{-{-<

        Hello AnomalousMonk,

        You make excellent points. Having read over this thread, I think I would now approach this task in a more long-winded — but hopefully safer — way by addressing the requirements separately:

        use strict; use warnings; my $text = "John P5ete 1 Andrew Richard58 Nic4k Le7on5 Ab5%&=-/zz."; my @words = split /\s+/, $text; my @names; for my $word (@words) { my @chars = $word =~ /[A-Z]/gi; my @digits = $word =~ /\d/g; my @symbols = $word =~ /\W/g; push @names, $word if @chars && @digits == 1 && !@symbols; } print "@names\n";

        Output:

        19:27 >perl 2057_SoPW.pl P5ete Nic4k 19:27 >

        This may or may not be exactly what the OP intended, but breaking down the code into separate parts like this at least makes it easier to tweak as and when the requirements are clarified.

        To the OP:

        • \W matches any non-word character; but, as the original string was split on whitespace, there are no whitespace characters in any $word and so within the for loop \W matches the sort of non-alphanumeric symbols identified by AnomalousMonk.
        • if @chars is Perlish shorthand for if scalar(@chars) != 0; similarly, if ... !@symbols is a shorter way of saying if ... scalar(@symbols) == 0.

        Cheers,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: regex search for words with one digit
by haj (Curate) on Sep 21, 2020 at 16:08 UTC

    Digits are, in Perl regular expressions, word characters.

    If you want to exclude digits, you can use character classes: Either those defined by POSIX (only if you don't have Unicode characters), or using Unicode properties in a recent Perl.

    Here's a Unicode-aware example:

    use strict; use warnings; my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b\p{Alphabetic}*\d\p{Alphabetic}*\b/g; print "@names\n";
      /\b\p{Alphabetic}*\d\p{Alphabetic}*\b/

      Note that this matches the name '1'.


      Give a man a fish:  <%-{-{-{-<

Re: regex search for words with one digit
by tybalt89 (Prior) on Sep 21, 2020 at 17:22 UTC

    The exclusion trick: \w and [^\W] match exactly the same thing, so to match a \w but not a \d, just use [^\W\d]

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11122003 use warnings; my $text = "John P5ete Andrew Richard58 Nic4k Le7on5"; my @names = $text =~ /\b[^\W\d]*\d[^\W\d]*\b/g; print "@names\n";

    Outputs:

    P5ete Nic4k
      /\b[^\W\d]*\d[^\W\d]*\b/

      Note that this matches the name '1'.


      Give a man a fish:  <%-{-{-{-<

        Rereading the spec, the name '1' is valid. :)

        If the spec is changed, I'll change my regex, but it will cost extra...

Re: regex search for words with one digit
by Anonymous Monk on Sep 21, 2020 at 16:45 UTC

    Thank you, and forgive me my rather silly question.

      I guess many of us have asked "silly" questions (I certainly have, and still am). The documentation is often intimidatingly vast. The upshot here is that the regex character classes, anchors and the like may perfectly overlap, and you have unwittingly not only used a class that was more inclusive that you assumed ("word characters" -- well, who can blame you? I have ignored the 'numeric' part in 'alphanumeric' myself more than once because of the 'word'.), but were therefore also unaware that \d is not complementary to \w, but its subclass. (This is simply not what we intuitively expect.) And that - as tybalt has already demonstrated - negated character classes ([^ ... whatever ...]) are often massively helpful, especially if you follow Athanasius' suggestion and define the character classes yourself (which makes your own code more transparent to you).

      Guessing from how you've written the regex, it seems to me that the following advice may also be helpful:

      Perl does not "understand what you mean", but, like any other programming language, slavishly follows the rules you have given it, and, since rules are rules, does not need to be told things twice. With that in mind, you learn a lot if you try to design regexes (like other code) as "thinly" as possible:

      my @names = $text =~ /\w*\d\w*/g;

      does exactly the same as your original regex: matching all "words" (in the above definition) which contain at least one digit, somewhere. (Not what you wanted, I know, but it's still instructive.) Why?

      Although the \b anchors do match what you mean them to match, they are redundant: their meaning is "Match a \w\W or \W\w boundary" (man perlref). But your \w* already matches everything that falls under the definition of \w (* is "greedy", as you may know), and that is necessarily until it hits a character that does not - which is precisely the definition of \W. In other words, until it hits "a \w\W boundary". (As \d is a subclass of \w, it will never match anything that matches \W either; in other words, it will stop at a \d\W border, if there isn't a \w in between.)

      (Or am I somehow mistaken? Why has everybody else kept the \b?)

        Why has everybody else kept the \b?

        I can't answer for others, but for me, throwing in boundary assertions like this is a reflexive, defensive (and possibly cargo-cultish) move I tend to use when I'm dealing with "words".

        The string the Anonymous Monk gives as an example is fairly straightforward: it's delimited by whitespace and the beginning and end of the string. As you say, \b will not help here (update: No! See tybalt89's reply), though it does no harm.

        Unfortunately, "words" be tricky. Is "word's" one word or two? If it's supposed to be one word, then Anonymous Monk's /\b\w*\d\w*\b/ or /\w*\d\w*/ or, I think, any of the other solutions I've seen so far will fail to match it with or without \bs. Words like "t'other", "wouldn't've", "words'" or "left-handed" can be difficult to deal with. I'm sure one could give many other examples, and that's just in English!

        In general, I think (?<! \S) and (?! \S) would serve better than \b as word boundaries (update: in the OPed case). But once again, it is unfortunately true that there are few generalities in human language.


        Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11122003]
Approved by Athanasius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2020-10-29 08:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (269 votes). Check out past polls.

    Notices?