RegEx filter \s ! between labels, part 2

gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks,

A while back, I sought wisdom regarding a RegEx filter that would filter \s+ from a scalar but skip filtering between to marker labels. I got several really great replies, but I've wandered into a new problem: I need the same filter but allow for multiple marker labels. Here are the details:

I have an large string scalar and I'd like to filter out multiple white spaces, converting them into just a single space per instance. Here's an example of the scalar:

A Bridge Too Far     Hosted by Rod Stuart     Friendly Skys
42    STARTPRESERVE1 Life, the universe...   and Everything STOPPRESER
+VE1   Bob
HotWheels are cool     More movies on Fox           File server
A Bridge Too Far     Hosted by Rod Stuart     Friendly Skys
42    STARTPRESERVE2 Life, the universe...   and Everything STOPPRESER
+VE2   Bob
HotWheels are cool     More movies on Fox           File server
[download]

Sounds like a job for $text =~ s/\s+/ /g;, except that I want to avoid the filtering between multiple START and STOP markers. Here's the original code that uses a single pair of markers (provided by IO):

$text =~ s/\s+|(STARTPRESERVE.*?STOPPRESERVE)/${[$1,' ']}[!$1]/gs;

This works great exept that I now need multiple markers like STARTPRESERVE1 .. 3 and so forth. My first thought was a loop through an array of these markers doing a regex, but of course that won't work. So I think it needs to be a single regex. (Right?) So here's my feeble attempt: (Please try not to laugh.)

$$text_ref =~ s%
  (STARTPRESERVE1.*?STOPPRESERVE1)
|
  (STARTPRESERVE2.*?STOPPRESERVE2)
|
  (PRESERVESTART.*?PRESERVESTOP)
|
  \s+
%
  ((defined $1 and ($1 =~ /^\s+$/)) ? ' ' : $1) .
  ((defined $2 and ($2 =~ /^\s+$/)) ? ' ' : $2) .
  ((defined $3 and ($3 =~ /^\s+$/)) ? ' ' : $3)
%eigsx;
[download]

Am I headed in the right direction, or am I missing an easier solution? Thanks all.

gryphon
code('Perl') || die;

Comment on RegEx filter \s ! between labels, part 2 Select or Download Code

Replies are listed 'Best First'.
Re: RegEx filter \s ! between labels, part 2 by fglock (Vicar) on Oct 02, 2002 at 19:55 UTC
Previous code was eliminating newlines. This one is matching a literal "space" instead of "\s". It is better documented, too :) `$text =~ s/ [ ]+ \| # spaces, or: ( # begin $1 STARTPRESERVE(\d?) # STARTPRESERVE + $2 .*? # anything STOPPRESERVE\2 # STOPPRESERVE + $2 ) # end $1 /${ [$1,' '] }[ ! $1 ]/sgx; # [$1,' '] is an unnamed array. # ! $1 is: # 0 if $1 exists; # 1 if spaces were found.` [download]	[reply] [d/l]
Re: Re: RegEx filter \s ! between labels, part 2 by gryphon (Abbot) on Oct 03, 2002 at 16:36 UTC
Er... OK, but this looks and functions like what I had before. How do I incorporate multiple PRESERVE tags? For example, I need to s/\s+/ /g everywhere except between MARKA and MARKB, between STARTPRESERVE and STOPPRESERVE, and between YETANOTHERMARKER and YETANOTHERMARKEREND. I can't loop through an array of regexes because they'd just cancel eachother out and do the s/\s+/ /g everywhere. I think I need a single regex, or perhaps a pulling appart of the string into an array split by the various markers, then each element analysed. What do you think is the best approach? `## START PSEUDO-CODE my $string = $big_string_from_my_original_example; my @array_of_stuff = split /\b(?:STARTPRESERVE)\|(?:STOPPRESERVE)\b\| \b(?:MARKA)\|(?:MARKB)\b/, $string); foreach (@array_of_stuff) { if (&check_for_marker) { push @new_array, $_; next; } s/\s+/ /g; push @new_array, $_; } $string = join '', @new_array; ## END PSEUDO-CODE` [download] Oh, and I actually do need to remove line breaks but keep the markers in place. And although it's theoretically possible, I highly doubt any markers will ever be nested. (If they are, it's due to user error.) gryphon code('Perl') \|\| die;	[reply] [d/l]
Re: RegEx filter \s ! between labels, part 2 by fglock (Vicar) on Oct 02, 2002 at 18:00 UTC
This is a bit weird, but it seems to work: ~~`$text =~ s/\s+\|(STARTPRESERVE(\d?).?STOPPRESERVE$1)/${[$1,' ']}[!$1]/gs;`~~ update: see merlin below. Now it works with nested PRESERVEs: `$text =~ s/\s+\|(STARTPRESERVE(\d?).?STOPPRESERVE\2)/${[$1,' ']}[!$1]/ +gs;` [download]	[reply] [d/l] [select]
•Re: Re: RegEx filter \s ! between labels, part 2 by merlyn (Sage) on Oct 02, 2002 at 18:02 UTC
`$text =~ s/\s+\|(STARTPRESERVE(\d?).*?STOPPRESERVE$1)/${[$1,' ']}[!$1]/ +gs;` [download] No... that first $1 is freezing at the wrong time. Perhaps you mean `\2`, not $1. -- Randal L. Schwartz, Perl hacker	[reply] [d/l] [select]
Re: RegEx filter \s ! between labels, part 2 by fglock (Vicar) on Oct 03, 2002 at 18:07 UTC
Ok, here it goes with these specifications: `%Marks = ( STARTPRESERVE => 'STOPPRESERVE', BEGIN => 'END' ); $MarkOptions = join("\|",keys %Marks); $text =~ s/ \s+ \| # spaces, or: ( # begin $1 ($MarkOptions)(\d?) # $2=START + $3=optional_number .*? # anything $Mark{\2}\3 # $Mark{START} + optional_number ) # end $1 /${ [$1,' '] }[ ! $1 ]/sgx; # [$1,' '] is an unnamed array. # ! $1 is: # 0 if $1 exists; # 1 if spaces were found.` [download] Note: you may add \b to the markers if you feel that you need it. I kept the optional number after the marker, you may remove it if you don't need it.	[reply] [d/l]


Pathologically Eclectic Rubbish Lister
	PerlMonks