Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Regex Question

by shemp (Deacon)
on Nov 15, 2005 at 19:31 UTC ( [id://508753]=perlquestion: print w/replies, xml ) Need Help??

shemp has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, i've been trying to construct a regex to help with cleaning up names that can appear in some pretty nasty formats, it's raw data from government agencies. Anyway, one thing i do is look for various 'care of' variants and standardize them to the % symbol. That is done through this regex:
$name =~ s/(^|\s) C (\/|\\|%) O (\s|$) / % /xgi;
(yes i am looking for the case of 'C%O' which i also see sometimes)

And then i want to replace forward of backslashes with &, as long as they do not have digits on both sides (which would be a fraction, which does sometimes appear in names i process). That is done through this regex:

$field =~ s/(?<!\d) # not preceeded by digits (?:\\|\/) # back or forward slash (?!\d) # not succeeded by digits / & /xg; # replace with '&' (globally)
The problem is that sometimes i want to perform only the second transformation without having done the first one, but i could not come up with any decent way to accomplish the second part with the exception of cases that are care-of's, as defined by the first part.

Any thoughts?


Update: I got this working effectively, but then realized that my spec gets even worse, because if i see something like "D/B/A", i want to change that to "DBA" (Doing Business As), so this adds a completely new twist onto when and when not to replace the slashes. So i added this regex after the care of regex:
$field =~ s/(?:^|[^A-Z]) ([A-Z])\/ ([A-Z])\/ ([A-Z]) (?:[^A-Z]|$) / $1$2$3 /xig;
Also, i think that using the separate regexes will work fine, i have worked around the problem of only wanting to perform the final stage without messing the earlier stages. Thanks for all the suggestions!

I use the most powerful debugger available: print!

Replies are listed 'Best First'.
Re: Regex Question
by Aristotle (Chancellor) on Nov 15, 2005 at 20:08 UTC

    Roy Johnson’s technique, used only for the skipping part, produces the desired result:

    $name =~ s{ (?<!C(?=.O)) # doesn't match C-something-O starting at previous ch +aracter (?<!\d) # not preceeded by digits (?:\\|\/) # back or forward slash (?!\d) # not succeeded by digits } { & }xg; # replace with '&' (globally)

    Makeshifts last the longest.

Re: Regex Question
by Aristotle (Chancellor) on Nov 15, 2005 at 19:39 UTC

    This should work:

    $field =~ s/ (?<!C) # not preceeded by "C" (?:\\|\/) # back or forward slash (?!O) # not succeeded by "O" | (?<!\d) # not preceeded by digits (?:\\|\/) # back or forward slash (?!\d) # not succeeded by digits / & /xg; # replace with '&' (globally)

    Nope; wrong. I realised my blunder right after posting.

    Makeshifts last the longest.

      You were bitten by negative-over-or-logic. The left side of the | will match what the right side is intended to skip. It has to be something like
      $field =~ s/ (?<!C(?=.O)) # Look back a character and make sure it's not C-some +thing-O from there (?<!\d(?=.\d)) # Make sure it's also not a digit-something-digit (?:\\|\/) # The something is a slash or backslash / & /xgi;

      Caution: Contents may have been coded under pressure.

        That won’t match the exact same things. The OP’s regex will replace the slash in "/2" or in "1/", yours won’t. Not sure if this is part of the spec, but it’s something to be aware of.

        Meh. That’s it, I’m going back to bed.

        Makeshifts last the longest.

      Nope. It replaces the slash in "9\\9" even though it shouldn't.

Re: Regex Question
by ikegami (Patriarch) on Nov 15, 2005 at 19:52 UTC

    This might be easiest:

    $field =~ s/((.)(?:\\|\/)(.))/ if ($2 eq 'C' && $3 eq 'O') { $1 } elsif (ord($2) >= ord('0') && ord($2) <= ord('9')) { $1 } elsif (ord($3) >= ord('0') && ord($3) <= ord('9')) { $1 } else { "$2 & $3" } /xeg;

    Update: I had a "==" where I needed a "eq". Also fixed the problem Roy found.

      That will only match slashes surrounded by a character on both sides. (?!\d) is not the same as \D.

      Makeshifts last the longest.

        And it won't work when there are multiple slashes in a row. Only some of them will get converted. Sorry, I meant to mention the caveats. I knew about them, but I thought they might be acceptable to the OP.
      Update: if ($1 eq 'C' && $2 eq 'O') should be if ($2 eq 'C' && $3 eq 'O'), because $1 is the entire matched expression.

      Caution: Contents may have been coded under pressure.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://508753]
Approved by rozallin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2024-04-18 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found