Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

RegEx filter \s ! between labels, part 2

by gryphon (Abbot)
on Oct 02, 2002 at 17:39 UTC ( [id://202333]=perlquestion: print w/replies, xml ) Need Help??

gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks,

A while back, I sought wisdom regarding a RegEx filter that would filter \s+ from a scalar but skip filtering between to marker labels. I got several really great replies, but I've wandered into a new problem: I need the same filter but allow for multiple marker labels. Here are the details:

I have an large string scalar and I'd like to filter out multiple white spaces, converting them into just a single space per instance. Here's an example of the scalar:

A Bridge Too Far Hosted by Rod Stuart Friendly Skys 42 STARTPRESERVE1 Life, the universe... and Everything STOPPRESER +VE1 Bob HotWheels are cool More movies on Fox File server A Bridge Too Far Hosted by Rod Stuart Friendly Skys 42 STARTPRESERVE2 Life, the universe... and Everything STOPPRESER +VE2 Bob HotWheels are cool More movies on Fox File server

Sounds like a job for $text =~ s/\s+/ /g;, except that I want to avoid the filtering between multiple START and STOP markers. Here's the original code that uses a single pair of markers (provided by IO):

$text =~ s/\s+|(STARTPRESERVE.*?STOPPRESERVE)/${[$1,' ']}[!$1]/gs;

This works great exept that I now need multiple markers like STARTPRESERVE1 .. 3 and so forth. My first thought was a loop through an array of these markers doing a regex, but of course that won't work. So I think it needs to be a single regex. (Right?) So here's my feeble attempt: (Please try not to laugh.)

$$text_ref =~ s% (STARTPRESERVE1.*?STOPPRESERVE1) | (STARTPRESERVE2.*?STOPPRESERVE2) | (PRESERVESTART.*?PRESERVESTOP) | \s+ % ((defined $1 and ($1 =~ /^\s+$/)) ? ' ' : $1) . ((defined $2 and ($2 =~ /^\s+$/)) ? ' ' : $2) . ((defined $3 and ($3 =~ /^\s+$/)) ? ' ' : $3) %eigsx;

Am I headed in the right direction, or am I missing an easier solution? Thanks all.

gryphon
code('Perl') || die;

Replies are listed 'Best First'.
Re: RegEx filter \s ! between labels, part 2
by fglock (Vicar) on Oct 02, 2002 at 19:55 UTC

    Previous code was eliminating newlines. This one is matching a literal "space" instead of "\s".

    It is better documented, too :)

    $text =~ s/ [ ]+ | # spaces, or: ( # begin $1 STARTPRESERVE(\d?) # STARTPRESERVE + $2 .*? # anything STOPPRESERVE\2 # STOPPRESERVE + $2 ) # end $1 /${ [$1,' '] }[ ! $1 ]/sgx; # [$1,' '] is an unnamed array. # ! $1 is: # 0 if $1 exists; # 1 if spaces were found.

      Er... OK, but this looks and functions like what I had before. How do I incorporate multiple PRESERVE tags? For example, I need to s/\s+/ /g everywhere except between MARKA and MARKB, between STARTPRESERVE and STOPPRESERVE, and between YETANOTHERMARKER and YETANOTHERMARKEREND.

      I can't loop through an array of regexes because they'd just cancel eachother out and do the s/\s+/ /g everywhere. I think I need a single regex, or perhaps a pulling appart of the string into an array split by the various markers, then each element analysed. What do you think is the best approach?

      ## START PSEUDO-CODE my $string = $big_string_from_my_original_example; my @array_of_stuff = split /\b(?:STARTPRESERVE)|(?:STOPPRESERVE)\b| \b(?:MARKA)|(?:MARKB)\b/, $string); foreach (@array_of_stuff) { if (&check_for_marker) { push @new_array, $_; next; } s/\s+/ /g; push @new_array, $_; } $string = join '', @new_array; ## END PSEUDO-CODE

      Oh, and I actually do need to remove line breaks but keep the markers in place. And although it's theoretically possible, I highly doubt any markers will ever be nested. (If they are, it's due to user error.)

      gryphon
      code('Perl') || die;

Re: RegEx filter \s ! between labels, part 2
by fglock (Vicar) on Oct 02, 2002 at 18:00 UTC

    This is a bit weird, but it seems to work:

    $text =~ s/\s+|(STARTPRESERVE(\d?).*?STOPPRESERVE$1)/${[$1,' ']}[!$1]/gs;

    update: see merlin below. Now it works with nested PRESERVEs:

    $text =~ s/\s+|(STARTPRESERVE(\d?).*?STOPPRESERVE\2)/${[$1,' ']}[!$1]/ +gs;
Re: RegEx filter \s ! between labels, part 2
by fglock (Vicar) on Oct 03, 2002 at 18:07 UTC

    Ok, here it goes with these specifications:

    %Marks = ( STARTPRESERVE => 'STOPPRESERVE', BEGIN => 'END' ); $MarkOptions = join("|",keys %Marks); $text =~ s/ \s+ | # spaces, or: ( # begin $1 ($MarkOptions)(\d?) # $2=START + $3=optional_number .*? # anything $Mark{\2}\3 # $Mark{START} + optional_number ) # end $1 /${ [$1,' '] }[ ! $1 ]/sgx; # [$1,' '] is an unnamed array. # ! $1 is: # 0 if $1 exists; # 1 if spaces were found.

    Note: you may add \b to the markers if you feel that you need it.
    I kept the optional number after the marker, you may remove it if you don't need it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://202333]
Approved by charnos
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-25 16:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found