Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: RegExp substitution

by Keystone (Initiate)
on Apr 10, 2014 at 20:17 UTC ( [id://1081863]=note: print w/replies, xml ) Need Help??


in reply to Re: RegExp substitution
in thread RegExp substitution

Thank you for a reply. I'm sorry to say I think I am still a bit too inexperienced in the Perl language to follow the code in your reply fully but I have had a go and tried to answer your questions as fully as I can;

What is my RegExp trying to match? I am trying to match and substitute the words in the string $_ by asking the user to input the correct string of number values. Originally I tried to match and substitute in each 'if' decision after the user input, however doing it this way I could not see a way to match to any string other than the first available without using a string literal.

i.e. $_ = "three, four" I could not see a way to match to 'four' without using the literal, whereas, as I understood it the power of a RegExp came from it being able to find something in a string without a literal constant.

In essence I suppose what I am trying to do is: Psuedo-

1st substitute/([A-Z][a-z][\W][\b])/<userinput>/;
then 2nd substitute/(NOT THIS ONE[A-Z][a-z][\W][\b])(THIS ONE[A-Z][a-z][\W] +[\b])/<userinput>/;
then 3rd substitute/(NOT THIS ONE[A-Z][a-z][\W][\b])(AND NOT THIS ONE[A-Z][ +a-z][\W][\b])(BUT THIS ONE[A-Z][a-z][\W][\b])/<userinput>/;

Does that make any kind of sense?

Your first RegEx(A-Za-z.\b) matches nothing because no words in the string are 2-characters in size, adding a plus to the lower case set a-z+ would match the first word, but as I understand it . is capable of matching nothing as well as anything, therefore I beleive it would match nothing and the next character in the match would be a comma when the match is actually looking for a break.

Why does the second regex match something, but only once when you want it to match several times? I'm unsure about this part so I can't answer this question easily, (?: , | \b)). ? allows the preceeding character to be optional (but there is no preceeding character?) and I can't see the use of a colon in this context. I understand however that the comma is a literal constant to look for, OR a break. '/x' I have not yet come across. Has it only matched once because it is not part of a loop to tell it to match as many times as I want? I don't want the comma, so perhaps look only for A-Za-z but then how then do I ignore these the second time I want to match? If I must match only once (as I originally had tried to do, then why does Perl not find anything for $2 $3 and $4?

As for Update 1 & 2 I'm afraid they're far beyond my capablities at this moment in time, I realise they're more than likely a cleaner way to write the code, I was simply trying to write a program for myself to show I understood RegExp (but clearly that is not the case!), I'm afraid the code in the two updates are far too advanced for me at this moment ;/

Replies are listed 'Best First'.
Re^3: RegExp substitution
by AnomalousMonk (Archbishop) on Apr 10, 2014 at 22:40 UTC
    ... (?: , | \b)). ? allows the preceeding character to be optional (but there is no preceeding character?) and I can't see the use of a colon in this context.

    The  (?:pattern) construct defines a non-capturing group. See Extended Patterns in perlre. This and other statements in your reply lead me to suggest that you take a big step backwards and read up on basic regex docs. Please see perlre. In particular, see perlretut for a very good tutorial. (I'm not familiar with the material in the Cozens book.) See also perlrequick for a quick reference. See also the material in the "Pattern Matching, Regular Expressions, and Parsing" area of the Tutorials section of this site.

      Thanks, I've just come across (?:pat) this morning when I woke up to have another go at the problem and thought I'd finish the chapter of Cozens book first. I'm also taking a look at the perldoc and tutorials (actually after only a few moments of reading I realised where I was going wrong and why $2 $3 and $4 don't exist!). Because perl matches as soon as it can, as much as it can and ends as soon as it can it was never going to find anything beyond 'Three' with the match I was making, (a very blunt and undetailed statment rewording your earlier post).

      However overall I think the exercise has been a sucess, I made the program to check my understanding of RegEx and it has lead to the discovery of new material to help me with this and future problems. I'll look at resolving this problem again once I have covered the material.

      Thank you for your patience,

      Regards,

      Keystone.

Re^3: RegExp substitution
by HereandThere (Initiate) on Apr 10, 2014 at 20:55 UTC
    Keystone,

    I am relatively new here at perlmonks, but perhaps I can help a little bit.

    You asked why the regexp matched only once, instead of multiple times. This is so because it is a FEATURE of the rules of regex to only do so unless something like the "global" switch is added *at the end of the regex in play*.

    If you invoke the global switch, then all matches will be replaced with the substitution string.

    So, if you had some code:

    $_="FourThreeTwoOne, Three, Four, One, Two";
    $1="Three";
    $second="&&&&";
    s/$2/$second/g;
    print;
    
    it would print this result:
    
    Four&&&&TwoOne, &&&&, Four, One, Two

    Hope this helps.

    -HaT

      Hello HereandThere, welcome to the Monastery. Hmm, did you try this code that you posted?
      $_="FourThreeTwoOne, Three, Four, One, Two"; $1="Three"; $second="&&&&"; s/$2/$second/g; print;
      I am pretty sure it cannot work. For a start, $1 is a read-only value that can be set only by a regex.
      $1="Three";
      I do not think that the perl compiler accepts that, but even if it did, it should not be done. In my view, $1 is a special variable that should be kept for just one single purpose: the first capture in a regex.
      s/$2/$second/g;
      This makes even less sense, since $2 has not been set anywhere, it is undefined, there is just no way this is gonna work. In addition, I would suggest that if you set $_ to something, you first localize it within some lexical block:
      { local $_="FourThreeTwoOne, Three, Four, One, Two"; # ... }
      Furthermore, you should probably have these pragmas at the top of your script:
      use strict; use warnings;
      and they would force you to rewrite the third line of your code as:
      my $second="&&&&";
      Finally, I don't even understand what your code is supposed to demonstrate to the OP. Well, to tell the truth, if you were not so new on this forum, I would probably down vote your post (although I almost never down vote posts for other reasons than spamming, insults, completely off-topic posts or other clear netiquette violations). I'll refrain from doing it here in consideration of the fact that you are new here.
      Hi there, Thanks for your reply, I had tried something using the /g global earlier but I discounted it for changing every variable every time, here is what I had:
      #!/usr/bin/perl #subs4.plx use warnings; use strict; #try using /g global to remember where I'm up to in a match my $pattern; $_ = "Three, Four, One, Two"; print ("\t\tCounting Program\n\n", $_, "\n\n"); my $correct; print "Is this sequence correct?(yes/no)\n"; $correct = <STDIN>; chomp ($correct); while ($correct ne "yes"){ print "Is the first number correct?\n"; my $first = <STDIN>; chomp ($first); if ($first ne "yes"){ print"What should it be?\n"; $first = <STDIN>; chomp ($first); /([A-Z][a-z]+)/g; s/$1/$first/g; } print "Is the second number correct?\n"; my $second = <STDIN>; chomp ($second); if ($second ne "yes"){ print"What should it be?\n"; $second = <STDIN>; chomp ($second); /([A-Z][a-z]+)/g; s/$2/$second/g; } print "Is the third number correct?\n"; my $third = <STDIN>; chomp ($third); if ($third ne "yes"){ print"What should it be?\n"; $third = <STDIN>; chomp ($third); /([A-Z][a-z]+)/g; s/$3/$third/g; } print "Is the fourth number correct?\n"; my $fourth = <STDIN>; chomp ($fourth); if ($fourth ne "yes"){ print"What should it be?\n"; $fourth = <STDIN>; chomp ($fourth); /([A-Z][a-z]+)/g; s/$4/$fourth/g; } #Final print print ($_, "\n\n"); print "Is this sequence correct now?(yes/no)\n"; $correct = <STDIN>; chomp ($correct); }
      After running through each of the <STDIN>'s the final print is "Four, Four, Four, Four" - in retrospect this is possibly the closest I got to my actual solution! This is what prompted me to ask the question where-in I was looking for a way to ignore the first match of a RegEx the second time it's run. Again, thank you for your help, I feel I am close to a solution. -- Just had a thoguht pre-posting, it is possible (but perhaps not elegant) to run the RegEx /A-Za-z+/, save the result to a variable and substitute the match with whitespace.. then call the variable later.. but thinking about it this is just a cheat/hack and not really using the substitute fnction of a RegEx. Regards Keystone
        ...
        s/$1/$first/g;
        ...
        s/$2/$second/g;
        ...
        s/$3/$third/g;
        ...
        s/$4/$fourth/g;
        ...

        The critical thing to realize about this code is that the capture variables  $2 $3 $4 have never been set to any meaningful value. I.e., they have the undefined value undef. When the undefined value is interpolated into a string or a regex, it interpolates as  '' (the empty string), or, in the case of a regex,  // (the empty regex).

        ...
        /([A-Z][a-z]+)/g;
        s/$2/$second/g;
        ...

        This pair of statements and corresponding succeeding statement pairs is very interesting. I strongly recommend you insert the statement
            print qq{=== '$_' \n};  # FOR DEBUG
        or its equivalent after each and every of the  s/// substitution statements to monitor what's going on with the progressive 'correction' of the initial string.

        Here's a narrative. As you can see from the newly-added debug print statement, the first
            /([A-Z][a-z]+)/g;
            s/$1/$first/g;
        statement pair actually does something expected and useful: it replaces the first number with 'One'. The output from the debug print statement is
            === 'One, Four, One, Two'

        The second
            /([A-Z][a-z]+)/g;
            s/$2/$second/g;
        statement pair replaces all numbers with 'Two'! The output from the debug print statement is
            === 'Two, Two, Two, Two'

        The reason for this odd behavior is that when  $2 with an undefined value interpolates into  s/$2/$second/g; it produces the  // empty regex match pattern. This pattern is special: it uses the last successful regex match pattern for matching. The last successful match pattern was in the  /([A-Z][a-z]+)/g; statement immediately before the  s/// substitution statement. Therefore,
            s/$2/$second/g;
        interpolates (ignoring, as you do, the warning message) as if it were
            s//$second/g;
        which matches as if it were
            s/([A-Z][a-z]+)/$second/g;
        which replaces each and every match (because of the  /g modifier) against the  ([A-Z][a-z]+) pattern (i.e., something that looks like a number) with, in this case, 'Two'. Whew!

        And similarly for each subsequent  //; s///; statement pair.

        That ought to give you something to think about while you're reviewing the regex documentation.

        (BTW: The  /g modifier in the  /([A-Z][a-z]+)/g; statement is at best useless and at worst confusing and corrupting. You cannot use the  /g modifier in this way to "keep track" of match positions in successive matches. (The  /c modifier in conjunction with the  /g modifier does something like this in certain cases, but I don't really see how it could be adapted to serve here.) You will have to think of some other way to query the user about successive numbers in the original string so that they may be 'corrected' one by one.)

      Laurent_R has already posted a reply that covers important points I had wanted to make.

      Let me say this in addition. You're trying to be helpful and that's very good, but it doesn't help to offer wrong advice. The reason I almost always post code as cut/pastes from command-line executions is that, in addition to providing a complete context for execution, the code is actually executed and so is known to be at least syntactically correct — and maybe even has a chance of being semantically correct.

Re^3: RegExp substitution
by AnomalousMonk (Archbishop) on Apr 11, 2014 at 00:40 UTC
    1st substitute/([A-Z][a-z][\W][\b])/<userinput>/;

    The critical thing to remember about  [...] character classes is that most regex metacharacters are not meta-special inside them. Thus,  [.] (which you have used elsewhere) matches a single  '.' (period) character and  [\b] matches a single backspace control-character. So the pattern above might be described as:

    •  [A-Z] A single upper-case character; followed by
    •  [a-z] A single lower-case character; followed by
    •  [\W] A single character that is anything not matching a  \w 'word' character (\W is a class onto itself, so no enclosing square brackets are needed); followed by
    •  [\b] A single backspace control character.
    Is any of that what you really wanted?

      Looking back I'm pretty sure that it is not, and having read your other posting directing me to the perl tutorials (less than 5 minutes in to reading them) I've discovered why my RegEx wasn't working as intended. I've posted my findings in a direct reply to that post.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1081863]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-03-29 00:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found