Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Curious Regex

by HamNRye (Monk)
on Feb 11, 2009 at 17:59 UTC ( [id://743119]=perlquestion: print w/replies, xml ) Need Help??

HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

I'm certain this is old hat to many of the regs here, but I've been bustin' a synapse here trying to get this done.

So, here's the idea... Translating info coming out of a legacy system to XHTML. There are control codes "\x90" "\x8F" that tell me useful things elsewhere in the text, but need to be excluded within certain tags. There could be any number of them within the tags.

Here is some sample text:

</Mil,p12.10,3,15,d>But he won<cm EQ>t say whether he would veto bill, which is short of what he wanted

Human readable instead of funky boxes...

</Mil,p12.10,3,15,d>But he won<cm EQ>t say whether \x90 he would veto bill, which \x90 is short of what he wanted \x90 \x9D

My current regex looks like $text =~ s/(<\/Mil[^>]*>.*?)[\x90\x8F](.*?\x9D)/$1$2/ig; And this only removes the very first control character as expected.

How can I create the regex so that between the open tag </Mil.*> and the close char \x9D I can remove any number of \x8F and \x90 characters?

This all takes place in the middle of ~100 different transforms that need to be done to each file, so having this as a one-liner would be nice.

Edit: Current regex updated to use [^>]* per Moritz's suggestion.

Replies are listed 'Best First'.
Re: Curious Regex
by ikegami (Patriarch) on Feb 11, 2009 at 18:15 UTC

    My current regex looks like $text =~ s/(<\/Mil.*>.*?)[\x90\x8F](.*?\x9D)/$1$2/ig; And this only removes the very first control character as expected.

    Not even.

    • It doesn't necessarily remove the first \x90 no \x8F because of the greedy /.*/.
    • It can remove a \x90 or \x8F after the \x9D if there is neither \x90 nor \x8F in one of the blocks. /.*?/ is rarely the right tool.
    • That would presume the newlines weren't there since /./ only matches newlines when the s modifier is in effect.

    Anyway, here's a two-state parser that should do the trick:

    my $out = ''; PROCESS: for ($text) { pos() = 0; for (;;) { # Search for Mil element for (;;) { /\G ( [^<]+ ) /xgc && $out .= $1; /\G ( <\/Mil(?:,[^>]*])?> ) /xgc && do { $out .= $1; last }; /\G ( < ) /xgc && $out .= $1; /\G \z /xgc && last PROCESS; } # Search for end of Mil element, # removing \x8F and \x90 as we go along. for (;;) { /\G ( [^\x8F\x90\x9D]+ ) /xgc && $out .= $1; /\G [\x8F\x90]+ /xgc; /\G \x9D /xgc && last; /\G \z /xgc && last PROCESS; } } }
Re: Curious Regex
by moritz (Cardinal) on Feb 11, 2009 at 18:12 UTC
    The first part of your regex currently matches </Mil,p12.10,3,15,d>But he won<cm EQ>, not only </Mil,p12.10,3,15,d>.

    If that's not what you want, you should change it to (</Mil[^>]*>).

    Then you can go on with

    m{ (</Mil[^>]*>[^\x90\x8F]*) [\x90\x8F] # your target characters... [^\x9D]* # anything but the terminator, # including \x90 and \x8F \x9D # the terminator }xs

    (I hope I understood your question correctly).

      Moritz, thanks for the reply.

      The question is how do I substitute out any number of the control characters?

      Reading your match as written, It would match up to the first control character and then match everything after it. Including more control characters.

Re: Curious Regex
by almut (Canon) on Feb 11, 2009 at 18:35 UTC

    You could use a two-stage approach: in the first stage you extract the entire part from <\/Mil to \x9D. In the second stage, you remove any control characters from the extracted substring:

    sub remove_ctrls { my $s = shift; $s =~ tr/\x90\x8F//d; return $s; } $text =~ s/(<\/Mil.*?>.*?\x9D)/remove_ctrls($1)/esg;

    The /e option makes the substitution part be treated as Perl code, i.e. it calls remove_ctrls() with the extracted substring.

    Personally, I find this easier to read than overcomplex regexes which would do it all in one go... YMMV, of course.

      Thanks... That helps with readability and is easy enough to understand. I don't use the "subroutines in regex" very often and just didn't think of it.

      I've got the code in place and it's working like a champ. the </Mil> tags are actually font declarations (Miller) so this will provide me some reusability of the subroutine if files start popping up with other fonts used.

      Thanks for the help Monks! It is very much appreciated.

Re: Curious Regex
by ELISHEVA (Prior) on Feb 11, 2009 at 18:34 UTC
    Another option, easier to read but slower than ikegami's solution:
    sub stripCtrls { my $text = shift; #changed capture of $1 as per moritz $text =~ /(<\/Mil[^>]*>)([^\x9D]*)\x9D/; my $sStartTag = $1; my $sBetweenTags = $2; $sBetweenTags =~ s/[\x90\x8F]//g; return "$sStartTag$sBetweenTags\x9D"; }

    which you can call like this stripCtrls($sTaggedText).

    And if you don't need to preserve \x90 and \x8F in the start tag, maybe you might try the even simpler $text =~ s/[\x90\x8F]//g?

    Best, beth

    Update: put in sub so can be used as one liner

      Thanks... Yeah, unfortunately the "simpler" method doesn't work because I need to preserve the control chars outside of the tags.

      Here's what I have as a "two stage" deal... I was just hoping someone could help me figure out how to make this work as a one liner.

      if ($text =~ /<\/Mil[^>]*>(.*)\x9D/i) { my $string = $1; $string =~ s/[\x90\x8F]//g; $text =~ /(<\/Mil[^>]*>).*(\x9D)/$1$string$2/i; }
        Well, basically what you have done is the same idea as what I did except that you have three regex evaluations when only two are needed - you can extract the start tag at the same time as the middle stuff and then put it back together using simple interpolation - which is what I did. Its a bit faster that way, but you should stick with whatever version you will understand three weeks from now.

        As for the one liner - just put it all in a sub - either your version or mine or ikegami's which is even faster - and presto - a one liner. That is what subs are for.

        Best, beth

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://743119]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-04-19 14:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found