RegEx to match at least one non-adjacent term

Cefu has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I appologize in advance for this being a RegEx rather than Perl specific question.

I'm trying to clean up a list of data entered in a free-text field (who needs validation anyway). For the most part the data consists of one or more numbers (which I want to keep) and sometimes color "words" which might appear before, after or between the number. I want to discard some specific colors but not other colors or other text. I'm trying to craft a regex to match the following and remove it:

The word Red and variations thereof (R, r, rd etc.); multiple occurrences; any case

Optionally surrounded by various spaces and parenthesis; not important if they are matched

So my current regex substitution looks something like:

s/\s*$?\s*re?d?\s*$?\s*//gi

This seemed to be working flawlessly until my spot checks revealed the following humorous example:

12345 Gray 6789 Red => 12345 Gay 6789

To avoid workplace embarrassment I thought it best to make sure that the bit I was removing occurred either just before or just after a number rather than in the middle of other text. So my thought is to modify the regex to somthing like

s/(\d?)\s*$?\s*re?d?\s*$?\s*(\d?)//gi

The problem is that I can't leave both digits optional (as shown) or I'm still in the same boat. I also can't make either one mandatory or I'm dictating a before-number-only or after-number-only match. What I really want is one or the other (or both) but not neither.

As you might guess from the parens around the digits, I also considered checking what matched in the second part and substituting back in the original if I didn't see a digit. However, I ran in to various problems ($1 being undefined, the A?B:C syntax not working inside the regex, etc.)

So, is there some nice way to do this in a single regex? Can I somehow ask that the regex match one or more of two disjointed parts?

Thanks,
Cefu

Update: Found a solution that almost matches my requirements and is actually better for what I needed. I'd downvote myself for not thinking this through first if I could. :)

s/(^|\d)\s*$?\s*re?d?\s*$?\s*($|\d)/$1$2/gi

While messing around with getting a conditional to work on the right side of the substitution I noticed that it happily substituted no characters when $1 or $2 are undefined. It does the same, without complaint, when they match the anchors ^ and $. So, rather than shoot for some weird hybrid of optional and mandatory I decided to make them both mandatory but with palatable alternatives (the anchors).

Where this differes from my requirements is that it will not match a Red-like word if there is a digit on one side and more text (rather than the beginning or end of the string) on the other. So, for example, with my new regex:

(Red) 123 Reddish-Orange 456 Orange 789 => 123 Reddish-Orange 456 Orange 789

Whereas if someone had come up with a way to get the behavior I asked for above it would have done this:

(Red) 123 Reddish-Orange 456 Orange 789 => 123 dish-Orange 456 Orange 789

The results of my number-or-edge-of-string before and after suit me better.

Requirements translate into code or code translates into requirements..... I never can seem to remember how that's supposed to go. :)

Comment on RegEx to match at least one non-adjacent term

Replies are listed 'Best First'.
Re: RegEx to match at least one non-adjacent term by ikegami (Patriarch) on Dec 07, 2007 at 16:12 UTC
Normally, you'd be able to use `\b`. `my $whitespace = qr/[\s()]+/; my $badwords = qr/.../i; my $wordchar = qr/[a-zA-Z]/; s/ $whitespace? \b $badwords \b $whitespace? / /xg;` [download] But since you want to allow `"12345Red6789"`, you'll have to implement your own version of `\b`. `my $whitespace = qr/[\s()]+/; my $badwords = qr/.../i; my $wordchar = qr/[a-zA-Z]/; s/ $whitespace? (?<! $wordchar ) # At start of word. $badwords # Words to erase. (?! $wordchar ) # At end of word. $whitespace? / /xg; # Avoid joining two numbers.` [download] By the way, Regexp::List can build an efficient $badwords. `use Regexp::List qw( ); my @badwords = qw( r rd red ); my $badwords = Regexp::List->new(modifiers=>'i')->list2re(@badwords); # qr/r(?:e?d)?/i` [download] Update: Added Regexp::List bit.	[reply] [d/l] [select]
Re^2: RegEx to match at least one non-adjacent term by Cefu (Beadle) on Dec 07, 2007 at 17:00 UTC
Thanks for the info on Regexp::List. I need to get more familiar with the modules that are out there. I think your initial solution would end up always leaving a space where the word was? That's fine if the word was between two number but not if it was at the beginning or end. I'm happy with the cheap trim I get from removing spaces along with the bad words. If I'm wrong about it always leaving a space, I appologize. I am committing the sin of commenting without executing the example as I don't have access to Perl on my internet connected machine.	[reply]
Re^3: RegEx to match at least one non-adjacent term by ikegami (Patriarch) on Dec 07, 2007 at 17:09 UTC
Yes it does, in order to avoid "1234 Red 5678" becoming "12345678". Feel free to remove extra whitespace afterwards. Doing it in the regex would needlessly complicate it. `s/.../ /xg; s/^\s+//; s/\s+$//;` [download]	[reply] [d/l]
Re: RegEx to match at least one non-adjacent term by tuxz0r (Pilgrim) on Dec 07, 2007 at 16:08 UTC
Not sure what your input looks like, but just testing with your string above this seems to work for me: `use strict; while (<STDIN>) { s/[\s]+//g; # strip optional chars my @matches = m/(\d+)\|([^\d]+)/g; print join ":", grep { defined $_ and $_ !~ m/^(r\|rd\|red)$/i } + @matches; }` [download] Which gives the following on various runs: `$ echo "12345 (Gray) 6789 (Red)" \| ./red.pl 12345:Gray:6789 $ echo "12345 (Gray) 6789 Red" \| ./red.pl 12345:Gray:6789 $ echo "12345Gray6789Red" \| ./red.pl 12345:Gray:6789` [download] Update: Ikegami pointed out that I missed the requirement that spaces/parens were optional, so I changed this around slightly (no longer just a one liner). Now we can just get the numbers and words out seperately and print them in order skipping any words (like variations of red) that you need. --- s;;:<).>\|\;\;_>?\\^0<\|=!]=,\|{\$/.'>\|<?.\|/"&?=#!>%\$\|#/\$%{};;y;,'} -/:-@[-`{-};,'}`-{/" -;;s;;$_;see; Warning: Any code posted by tuxz0r is untested, unless otherwise stated, and is used at your own risk.	[reply] [d/l] [select]
Re^2: RegEx to match at least one non-adjacent term by ikegami (Patriarch) on Dec 07, 2007 at 16:26 UTC
Doesn't work? `$ echo "12345Red6789" \| perl -ape 's/[\s$]+(r\|rd\|red)[\s$]+//gi;' 12345Red6789` [download] The OP said the spaces were optional.	[reply] [d/l]
Re: RegEx to match at least one non-adjacent term by toolic (Bishop) on Dec 07, 2007 at 16:34 UTC
I know you are looking for a regex solution, but just in case you change your mind, maybe something like this would suit your needs... Updated: to handle parentheses. `#!/usr/bin/env perl use warnings; use strict; while (<DATA>) { #my @no_reds = grep { !/^r/i } split; my @no_reds = grep { !/^r/i } split /[\s]/; print "@no_reds\n"; } __DATA__ 12345 Gray 6789 Red rd 555 blue 888 red 777 green 999 foo bar r baz ruby goo(r) ruby doobie 888red999` [download] Prints out: `12345 Gray 6789 555 blue 888 777 green 999 foo bar baz goo doobie 888red999` [download] Unlike ikegami's solution, this requires the input to be whitespace-separated.	[reply] [d/l] [select]
Re^2: RegEx to match at least one non-adjacent term by Cefu (Beadle) on Dec 07, 2007 at 17:12 UTC
Thanks for the idea. It's always good to break out of my assumption that RegEx is the right way and consider something else like Grep. Unfortunately, this was a free-text field filled by users, so I can't assume they'll give me whitespacea, any other convention or even the results of normal human brain function. Also, this method is either a little lacking or it's way TOO smart if it knows that rubies are actually red. :)	[reply]
Re: RegEx to match at least one non-adjacent term by CountZero (Bishop) on Dec 07, 2007 at 21:38 UTC
Actually, your references to numbers has been leading us on a wild goose chase. It seems to me that all you want to do is delete any form of "bad" color words from the user input. The following regex will do that nicely: `s/\b(re?d?)\b/ /gi`. It replaces your r, re, red, rd by a space (to avoid running numbers together). Collapsing multiple spaces is easy in a separate regex and left as an exercise for the readers. The whole trick is in the `\b` which is a zero width assertion matching the boundary between a word and a non-word character. Update: Forget what I said, it won't work on something like "1234red5678". If I change the regex to work on this, I arrive at ikegami's solution. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re: RegEx to match at least one non-adjacent term by Not_a_Number (Prior) on Dec 07, 2007 at 19:43 UTC
I'm not sure what you want to achieve here. For the following input: `12 Fred 34 Freda 56 redecorating 78 reader 90` what output would you require?	[reply] [d/l]


more useful options
	PerlMonks