Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Regex (counting) confusion :(

by snax (Hermit)
on Sep 18, 2003 at 13:53 UTC ( [id://292389]=perlquestion: print w/replies, xml ) Need Help??

snax has asked for the wisdom of the Perl Monks concerning the following question:

So I'm working on a quick script that requires me to do some counting. The long and short of it is that I have some strings (generated elsewhere in the script) and I need to know how many newlines (for example) they contain. The canonical response (I've discovered) is to use tr//:
$count = ($string =~ tr/\n/\n/);
Well, because there were other things I wanted to count I ended up using a similar construction with s///g (did you know that \s matches \n for m// and s/// but not for tr///? I didn't) which is where I get confused. What do you expect the output from these to be:
# Greedy $string = q(xxxx); $count = ($string =~ s/x*/#/g); print qq($string $count), $/; # Not greedy $string = q(xxxx); $count = ($string =~ s/x*?/#/g); print qq($string $count), $/;
I expected # 1 and #### 4. I got ## 2 and ######### 9. Excuse me? How's that again? 2 and 9? Color me confused.

Can anyone shed some light on this and maybe help me figure out how I can see what perl's doing to arrive at these numbers?

Note: I did figure out that my confusion stems from using * rather than + as modifier. Using + does give 1 and 4 (greedy vs. not-greedy) above. Whoops :) Even so, I'm still confused about the 2 and 9 from using star....

Replies are listed 'Best First'.
Re: Regex (counting) confusion :(
by thinker (Parson) on Sep 18, 2003 at 14:05 UTC

    Hi snax

    I suspect the 9 is coming from substituting the four letters, plus the beginning, the end, and the 3 boundaries between the letters, with a #.

    thinker

Re: Regex (counting) confusion :(
by fletcher_the_dog (Friar) on Sep 18, 2003 at 14:25 UTC
    The 2 comes from the fact that "x*" can match nothing. For "xxxx", "x*" matches the null width "beginning" then matches all the "x"s to the end. The 9 is because of the non-greediness, so you get a match at the null-width beginning, at each character, between each character, and the end.
    UPDATE: I ran your original code using "use re 'debug'" and it looks like when you get 2 it is actually matching the "xxxx" and then the end not the beginning.
      For "xxxx", "x*" matches the null width "beginning" then matches all the "x"s to the end.

      Well, that would mean the first result would be 1, not 2. What happens is that /x*/ matches the zero-length string at the beginning, all the x-es, but not the zero-length string at the end. After the first substitution, the regex hasn't reached the end of the string yet, so /g kicks in. All that's left is the zero-width string at the end - this is now matched (were it wasn't before), and hence we get a second substitution, resulting in a result of 2 and a final "##" string.

      Frankly, I find this behaviour unexpected and unwanted. I'd call it a bug, but I bet someone once had a use for this, and now that's the way it goes.

      Abigail

        That's what I was looking for. Danke schoen!
Re: Regex (counting) confusion :(
by tcf22 (Priest) on Sep 18, 2003 at 14:31 UTC
    Take a look at what this outputs
    # Not greedy pos() = 0; $string = q(xxxx); $count = ($string =~ s/[^x]*?/#/g); print qq($string $count), $/; __OUTPUT__ #x#x#x#x# 5
    It looks to match the boundaries, and the beginning and end. They are matched because you are searching for 0 or more x's.

    - Tom

Re: Regex (counting) confusion :( (link)
by tye (Sage) on Sep 18, 2003 at 15:21 UTC
Re: Regex (counting) confusion :(
by Abigail-II (Bishop) on Sep 18, 2003 at 14:33 UTC
    (did you know that \s matches \n for m// and s/// but not for tr///? I didn't)

    Yes, I did. tr/// is not a regexp. \s isn't a newline in qq//, q//, qx// or qw// either.

    Abigail

Re: Regex (counting) confusion :(
by Anonymous Monk on Sep 18, 2003 at 14:24 UTC
    # Greedy $string = q(xxxx); $count = ($string =~ s/x*/#/g); print qq($string $count), $/; If you use "$count = ($string =~ s/x*/#/);" instead "$count = ($string =~ s/x*/#/g);" You will get what you want:)
      greedy:
      "*" will matches "xxxx" and "" (count 2)
      but "+" only matches "xxxx" (count 1)

      not greedy
      "*" will matches "" "x" "" "x" "" "x" "" "x" "" (count 9)
      but "+" only matches "x" "x" "x" "x" (count 4)

      Don't ask me why:(

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://292389]
Approved by cchampion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-20 00:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found