Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

RegExp substitution

by Keystone (Initiate)
on Apr 10, 2014 at 16:52 UTC ( [id://1081838]=perlquestion: print w/replies, xml ) Need Help??

Keystone has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, new to Perl and working through Simon Cozens free Beginning Perl book, tried to write myself a little program to test if I understood RegExp and it's not giving me the expected answers. Could anyone offer any guidance as to why the final print of $_ gives "FourThreeTwoOne, Three, Four, One, Two" please? As I said I'm only a novice, please be gentle!:)

#!/usr/bin/perl #subs.plx use warnings; use strict; #An incorrectly ordered list to have the user organise $_ = "Three, Four, One, Two"; print ("\t\tCounting Program\n\n", $_, "\n\n"); my $correct; print "Is this sequence correct?(yes/no)\n"; $correct = <STDIN>; chomp ($correct); while ($correct ne "yes"){ print "Is the first number correct?\n"; my $first = <STDIN>; chomp ($first); if ($first ne "yes"){ print"What should it be?\n"; $first = <STDIN>; chomp ($first); } print "Is the second number correct?\n"; my $second = <STDIN>; chomp ($second); if ($second ne "yes"){ print"What should it be?\n"; $second = <STDIN>; chomp ($second); } print "Is the third number correct?\n"; my $third = <STDIN>; chomp ($third); if ($third ne "yes"){ print"What should it be?\n"; $third = <STDIN>; chomp ($third); } print "Is the fourth number correct?\n"; my $fourth = <STDIN>; chomp ($fourth); if ($fourth ne "yes"){ print"What should it be?\n"; $fourth = <STDIN>; chomp ($fourth); } #My RegExp /([A-Z][a-z][.][\b])/; #The substitutions based on my RegExp s/$1/$first/; s/$2/$second/; s/$3/$third/; s/$4/$fourth/; #Final print reads:FourThreeTwoOne, Three, Four, One, Two print ($_, "\n\n"); print "Is this sequence correct now?(yes/no)\n"; $correct = <STDIN>; chomp ($correct); }

Anby guidance would be appreciated, Cheers, Keystone.

Replies are listed 'Best First'.
Re: RegExp substitution
by AnomalousMonk (Archbishop) on Apr 10, 2014 at 17:29 UTC

    But what does your RegExp actually match? (Also a 'fixed' version based on what I think you think you want.)

    c:\@Work\Perl\monks>perl -wMstrict -le "$_ = 'Three, Four, One, Two'; ;; /([A-Z][a-z][.][\b])/; print qq{'$1' '$2' '$3' '$4'}; ;; /([A-Z] [a-z]+ (?: , | \b))/x; print qq{'$1' '$2' '$3' '$4'}; " Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $2 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $3 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $4 in concatenation (.) or string at -e lin +e 1. '' '' '' '' Use of uninitialized value $2 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $3 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $4 in concatenation (.) or string at -e lin +e 1. 'Three,' '' '' ''

    Why does the first regex match nothing at all? (That should be fairly easy to answer: take a careful look at it.) Why does the second regex match something, but only once when you want it to match several times? Why does  'Three,' have a comma at the end? Do you really want to capture this character?

    Update 1: Another thing to remember is that each successful regex match (and a  s/// substitution must do a match — and you're doing four  s/// in a row) that is executed "wipes out" all capture variables  $1 $2 $3 $n and only re-assigns those corresponding to an actual capture group in the latest successful match.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'foo bar baz'; $s =~ m{ (foo) \s* (bar) \s* (baz) }xms; print qq{A: '$1' '$2' '$3'}; ;; $s =~ m{ (xyzzy) }xms; print qq{B: '$1' '$2' '$3'}; ;; $s =~ m{ (b \w*) }xms; print qq{C: '$1' '$2' '$3'}; " A: 'foo' 'bar' 'baz' B: 'foo' 'bar' 'baz' Use of uninitialized value $2 in concatenation (.) or string at -e lin +e 1. Use of uninitialized value $3 in concatenation (.) or string at -e lin +e 1. C: 'bar' '' ''

    Update 2: Here's an approach (one of many) to the problem, but without the annoying  <STDIN> stuff:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'Three, Four, One, Two, xFive9'; print qq{'$s'}; ;; my @numbers = $s =~ m{ \b [[:upper:]] [[:lower:]]+ \b }xmsg; printf qq{'$_' } for @numbers; print ''; ;; my %correct; @correct{ @numbers } = qw(one two three four); ;; my ($rx_search) = map qr{ \b (?: $_) \b }xms, join '|', map quotemeta, keys %correct ; print $rx_search; ;; $s =~ s{ ($rx_search) }{$correct{$1}}xmsg; print qq{'$s'}; " 'Three, Four, One, Two, xFive9' 'Three' 'Four' 'One' 'Two' (?^msx: \b (?: Four|Three|Two|One) \b ) 'one, two, three, four, xFive9'

      Thank you for a reply. I'm sorry to say I think I am still a bit too inexperienced in the Perl language to follow the code in your reply fully but I have had a go and tried to answer your questions as fully as I can;

      What is my RegExp trying to match? I am trying to match and substitute the words in the string $_ by asking the user to input the correct string of number values. Originally I tried to match and substitute in each 'if' decision after the user input, however doing it this way I could not see a way to match to any string other than the first available without using a string literal.

      i.e. $_ = "three, four" I could not see a way to match to 'four' without using the literal, whereas, as I understood it the power of a RegExp came from it being able to find something in a string without a literal constant.

      In essence I suppose what I am trying to do is: Psuedo-

      1st substitute/([A-Z][a-z][\W][\b])/<userinput>/;
      then 2nd substitute/(NOT THIS ONE[A-Z][a-z][\W][\b])(THIS ONE[A-Z][a-z][\W] +[\b])/<userinput>/;
      then 3rd substitute/(NOT THIS ONE[A-Z][a-z][\W][\b])(AND NOT THIS ONE[A-Z][ +a-z][\W][\b])(BUT THIS ONE[A-Z][a-z][\W][\b])/<userinput>/;

      Does that make any kind of sense?

      Your first RegEx(A-Za-z.\b) matches nothing because no words in the string are 2-characters in size, adding a plus to the lower case set a-z+ would match the first word, but as I understand it . is capable of matching nothing as well as anything, therefore I beleive it would match nothing and the next character in the match would be a comma when the match is actually looking for a break.

      Why does the second regex match something, but only once when you want it to match several times? I'm unsure about this part so I can't answer this question easily, (?: , | \b)). ? allows the preceeding character to be optional (but there is no preceeding character?) and I can't see the use of a colon in this context. I understand however that the comma is a literal constant to look for, OR a break. '/x' I have not yet come across. Has it only matched once because it is not part of a loop to tell it to match as many times as I want? I don't want the comma, so perhaps look only for A-Za-z but then how then do I ignore these the second time I want to match? If I must match only once (as I originally had tried to do, then why does Perl not find anything for $2 $3 and $4?

      As for Update 1 & 2 I'm afraid they're far beyond my capablities at this moment in time, I realise they're more than likely a cleaner way to write the code, I was simply trying to write a program for myself to show I understood RegExp (but clearly that is not the case!), I'm afraid the code in the two updates are far too advanced for me at this moment ;/

        ... (?: , | \b)). ? allows the preceeding character to be optional (but there is no preceeding character?) and I can't see the use of a colon in this context.

        The  (?:pattern) construct defines a non-capturing group. See Extended Patterns in perlre. This and other statements in your reply lead me to suggest that you take a big step backwards and read up on basic regex docs. Please see perlre. In particular, see perlretut for a very good tutorial. (I'm not familiar with the material in the Cozens book.) See also perlrequick for a quick reference. See also the material in the "Pattern Matching, Regular Expressions, and Parsing" area of the Tutorials section of this site.

        Keystone,

        I am relatively new here at perlmonks, but perhaps I can help a little bit.

        You asked why the regexp matched only once, instead of multiple times. This is so because it is a FEATURE of the rules of regex to only do so unless something like the "global" switch is added *at the end of the regex in play*.

        If you invoke the global switch, then all matches will be replaced with the substitution string.

        So, if you had some code:

        $_="FourThreeTwoOne, Three, Four, One, Two";
        $1="Three";
        $second="&&&&";
        s/$2/$second/g;
        print;
        
        it would print this result:
        
        Four&&&&TwoOne, &&&&, Four, One, Two

        Hope this helps.

        -HaT

        1st substitute/([A-Z][a-z][\W][\b])/<userinput>/;

        The critical thing to remember about  [...] character classes is that most regex metacharacters are not meta-special inside them. Thus,  [.] (which you have used elsewhere) matches a single  '.' (period) character and  [\b] matches a single backspace control-character. So the pattern above might be described as:

        •  [A-Z] A single upper-case character; followed by
        •  [a-z] A single lower-case character; followed by
        •  [\W] A single character that is anything not matching a  \w 'word' character (\W is a class onto itself, so no enclosing square brackets are needed); followed by
        •  [\b] A single backspace control character.
        Is any of that what you really wanted?

Re: RegExp substitution
by AnomalousMonk (Archbishop) on Apr 11, 2014 at 05:04 UTC

    Here's an approach. File ask_to_replace_1.pl:

    use warnings; use strict; my $string = 'Three, Four, One, Two, xFive9, six, Seven'; my $number = qr{ \b [[:upper:]] [[:lower:]]+ \b }xms; print qq{string is now '$string' \n}; my $ordinal = 0; $string =~ s{ ($number) } { ask_replace(++$ordinal, $-[1], $1) }xmsge; print qq{new string is '$string' \n}; print qq{done! \n}; sub ask_replace { my ($ordinal, $offset, $string, ) = @_; my $yes = qr{ (?i) y (?: e (?: s)? )? }xmso; my $ok = qr{ (?i) o (?: k)? }xmso; my $accept = qr{ \A (?: $yes | $ok) \Z }xmso; print qq{sub-string $ordinal at offset $offset is '$string' \n}; print qq{is this correct? }; my $answer = <stdin>; return $string if $answer =~ $accept; print qq{no: enter new string: }; chomp(my $replace = <stdin>); return $replace; }

    Output:

    c:\@Work\Perl\monks\Keystone>perl ask_to_replace_1.pl string is now 'Three, Four, One, Two, xFive9, six, Seven' sub-string 1 at offset 0 is 'Three' is this correct? n no: enter new string: Uno sub-string 2 at offset 7 is 'Four' is this correct? No no: enter new string: Dos sub-string 3 at offset 13 is 'One' is this correct? y sub-string 4 at offset 18 is 'Two' is this correct? x no: enter new string: Tres sub-string 5 at offset 36 is 'Seven' is this correct? n no: enter new string: se7en new string is 'Uno, Dos, One, Tres, xFive9, six, se7en' done!
Re: RegExp substitution
by MidLifeXis (Monsignor) on Apr 10, 2014 at 18:01 UTC

    You could always try out Regexp::Debugger.

    Update: Corrected module name

    --MidLifeXis

      I love the idea MidLifeXis :-) ... but, sadly, the link doesn't quite work :-(

      Have you managed to get [doc://...] and [mod://...] intermixed ?

      A user level that continues to overstate my experience :-))

        No, s/RegExp/Regexp/. .oO( Note to self - check links after posting )

        --MidLifeXis

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1081838]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-19 12:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found