Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

RegEx to match at least one non-adjacent term

by Cefu (Beadle)
on Dec 07, 2007 at 15:44 UTC ( [id://655675]=perlquestion: print w/replies, xml ) Need Help??

Cefu has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I appologize in advance for this being a RegEx rather than Perl specific question.

I'm trying to clean up a list of data entered in a free-text field (who needs validation anyway). For the most part the data consists of one or more numbers (which I want to keep) and sometimes color "words" which might appear before, after or between the number. I want to discard some specific colors but not other colors or other text. I'm trying to craft a regex to match the following and remove it:

  • The word Red and variations thereof (R, r, rd etc.); multiple occurrences; any case
  • Optionally surrounded by various spaces and parenthesis; not important if they are matched
  • So my current regex substitution looks something like:

       s/\s*\(?\s*re?d?\s*\)?\s*//gi

    This seemed to be working flawlessly until my spot checks revealed the following humorous example:

       12345 Gray 6789 Red => 12345 Gay 6789

    To avoid workplace embarrassment I thought it best to make sure that the bit I was removing occurred either just before or just after a number rather than in the middle of other text. So my thought is to modify the regex to somthing like

       s/(\d?)\s*\(?\s*re?d?\s*\)?\s*(\d?)//gi

    The problem is that I can't leave both digits optional (as shown) or I'm still in the same boat. I also can't make either one mandatory or I'm dictating a before-number-only or after-number-only match. What I really want is one or the other (or both) but not neither.

    As you might guess from the parens around the digits, I also considered checking what matched in the second part and substituting back in the original if I didn't see a digit. However, I ran in to various problems ($1 being undefined, the A?B:C syntax not working inside the regex, etc.)

    So, is there some nice way to do this in a single regex? Can I somehow ask that the regex match one or more of two disjointed parts?

    Thanks,
    Cefu

    Update: Found a solution that almost matches my requirements and is actually better for what I needed. I'd downvote myself for not thinking this through first if I could. :)

       s/(^|\d)\s*\(?\s*re?d?\s*\)?\s*($|\d)/$1$2/gi

    While messing around with getting a conditional to work on the right side of the substitution I noticed that it happily substituted no characters when $1 or $2 are undefined. It does the same, without complaint, when they match the anchors ^ and $. So, rather than shoot for some weird hybrid of optional and mandatory I decided to make them both mandatory but with palatable alternatives (the anchors).

    Where this differes from my requirements is that it will not match a Red-like word if there is a digit on one side and more text (rather than the beginning or end of the string) on the other. So, for example, with my new regex:

       (Red) 123 Reddish-Orange 456 Orange 789 => 123 Reddish-Orange 456 Orange 789

    Whereas if someone had come up with a way to get the behavior I asked for above it would have done this:

       (Red) 123 Reddish-Orange 456 Orange 789 => 123 dish-Orange 456 Orange 789

    The results of my number-or-edge-of-string before and after suit me better.

    Requirements translate into code or code translates into requirements..... I never can seem to remember how that's supposed to go. :)

    • Comment on RegEx to match at least one non-adjacent term

    Replies are listed 'Best First'.
    Re: RegEx to match at least one non-adjacent term
    by ikegami (Patriarch) on Dec 07, 2007 at 16:12 UTC

      Normally, you'd be able to use \b.

      my $whitespace = qr/[\s()]+/; my $badwords = qr/.../i; my $wordchar = qr/[a-zA-Z]/; s/ $whitespace? \b $badwords \b $whitespace? / /xg;

      But since you want to allow "12345Red6789", you'll have to implement your own version of \b.

      my $whitespace = qr/[\s()]+/; my $badwords = qr/.../i; my $wordchar = qr/[a-zA-Z]/; s/ $whitespace? (?<! $wordchar ) # At start of word. $badwords # Words to erase. (?! $wordchar ) # At end of word. $whitespace? / /xg; # Avoid joining two numbers.

      By the way, Regexp::List can build an efficient $badwords.

      use Regexp::List qw( ); my @badwords = qw( r rd red ); my $badwords = Regexp::List->new(modifiers=>'i')->list2re(@badwords); # qr/r(?:e?d)?/i

      Update: Added Regexp::List bit.

        Thanks for the info on Regexp::List. I need to get more familiar with the modules that are out there.

        I think your initial solution would end up always leaving a space where the word was? That's fine if the word was between two number but not if it was at the beginning or end. I'm happy with the cheap trim I get from removing spaces along with the bad words.

        If I'm wrong about it always leaving a space, I appologize. I am committing the sin of commenting without executing the example as I don't have access to Perl on my internet connected machine.

          Yes it does, in order to avoid "1234 Red 5678" becoming "12345678". Feel free to remove extra whitespace afterwards. Doing it in the regex would needlessly complicate it.

          s/.../ /xg; s/^\s+//; s/\s+$//;
    Re: RegEx to match at least one non-adjacent term
    by tuxz0r (Pilgrim) on Dec 07, 2007 at 16:08 UTC
      Not sure what your input looks like, but just testing with your string above this seems to work for me:
      use strict; while (<STDIN>) { s/[\s\(\)]+//g; # strip optional chars my @matches = m/(\d+)|([^\d]+)/g; print join ":", grep { defined $_ and $_ !~ m/^(r|rd|red)$/i } + @matches; }
      Which gives the following on various runs:
      $ echo "12345 (Gray) 6789 (Red)" | ./red.pl 12345:Gray:6789 $ echo "12345 (Gray) 6789 Red" | ./red.pl 12345:Gray:6789 $ echo "12345Gray6789Red" | ./red.pl 12345:Gray:6789
      Update: Ikegami pointed out that I missed the requirement that spaces/parens were optional, so I changed this around slightly (no longer just a one liner). Now we can just get the numbers and words out seperately and print them in order skipping any words (like variations of red) that you need.

      ---
      s;;:<).>|\;\;_>?\\^0<|=!]=,|{\$/.'>|<?.|/"&?=#!>%\$|#/\$%{};;y;,'} -/:-@[-`{-};,'}`-{/" -;;s;;$_;see;
      Warning: Any code posted by tuxz0r is untested, unless otherwise stated, and is used at your own risk.

        Doesn't work?
        $ echo "12345Red6789" | perl -ape 's/[\s\(]+(r|rd|red)[\s\)]+//gi;' 12345Red6789

        The OP said the spaces were optional.

    Re: RegEx to match at least one non-adjacent term
    by toolic (Bishop) on Dec 07, 2007 at 16:34 UTC
      I know you are looking for a regex solution, but just in case you change your mind, maybe something like this would suit your needs...

      Updated: to handle parentheses.

      #!/usr/bin/env perl use warnings; use strict; while (<DATA>) { #my @no_reds = grep { !/^r/i } split; my @no_reds = grep { !/^r/i } split /[\s\(\)]/; print "@no_reds\n"; } __DATA__ 12345 Gray 6789 Red rd 555 blue 888 red 777 green 999 foo bar r baz ruby goo(r) ruby doobie 888red999

      Prints out:

      12345 Gray 6789 555 blue 888 777 green 999 foo bar baz goo doobie 888red999

      Unlike ikegami's solution, this requires the input to be whitespace-separated.

        Thanks for the idea. It's always good to break out of my assumption that RegEx is the right way and consider something else like Grep.

        Unfortunately, this was a free-text field filled by users, so I can't assume they'll give me whitespacea, any other convention or even the results of normal human brain function.

        Also, this method is either a little lacking or it's way TOO smart if it knows that rubies are actually red. :)

    Re: RegEx to match at least one non-adjacent term
    by CountZero (Bishop) on Dec 07, 2007 at 21:38 UTC
      Actually, your references to numbers has been leading us on a wild goose chase. It seems to me that all you want to do is delete any form of "bad" color words from the user input. The following regex will do that nicely: s/\b(re?d?)\b/ /gi. It replaces your r, re, red, rd by a space (to avoid running numbers together). Collapsing multiple spaces is easy in a separate regex and left as an exercise for the readers. The whole trick is in the \b which is a zero width assertion matching the boundary between a word and a non-word character.

      Update: Forget what I said, it won't work on something like "1234red5678". If I change the regex to work on this, I arrive at ikegami's solution.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    Re: RegEx to match at least one non-adjacent term
    by Not_a_Number (Prior) on Dec 07, 2007 at 19:43 UTC

      I'm not sure what you want to achieve here. For the following input:

      12 Fred 34 Freda 56 redecorating 78 reader 90

      what output would you require?

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://655675]
    Approved by ikegami
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (3)
    As of 2024-04-18 04:12 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found