Supress similar chars in the string

Lana has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Supress similar chars in the string by kennethk (Abbot) on Jan 18, 2011 at 17:22 UTC
You can accomplish your desired result using Backreferences. With your sample input, I would use the code: `#!/usr/bin/perl use strict; use warnings; $_ = 'testing 1234567 ........ aaaaaaaasssssss __________ ++++++++++ - +-------- testing testing '; s/((.)\2\2)\2/$1/g; print;` [download] to output `testing 1234567 ... aaasss ___ +++ --- testing testing` YAPE::Regex::Explain describes the regex as: ( group and capture to \1: ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- \2 what was matched by capture \2 ---------------------------------------------------------------------- \2 what was matched by capture \2 ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \2 what was matched by capture \2 (0 or more times (matching the most amount possible)) [download] Note I have kept the third 'testing' in the string, which seems to follow spec but not your example. Update: As AnomalousMonk points out below, swapping `\2*` to `\2+` is more rigorously correct and avoids some unnecessary no-ops. However, I didn't think this particular optimization was likely to have sufficient impact to warrant changing the post content. For a cleaner version, you can also use `s/(.)\1{3,}/$1 x 3/eg;` (or `s/(.)\1{$n,}/$1 x $n/eg;`).	[reply] [d/l] [select]
Re^2: Supress similar chars in the string by toolic (Bishop) on Jan 18, 2011 at 18:10 UTC
And if the OP ever had the need to scale it up to more than 3 characters: `use strict; use warnings; my $n = 3; # number of adjacent characters $n--; $_ = 'testing 1234567 ........ aaaaaaaasssssss __________ ++++++++++ - +-------- testing testing '; s/((.)\2{$n})\2*/$1/g; print;` [download]	[reply] [d/l]
Re^3: Supress similar chars in the string by Jim (Curate) on Jan 18, 2011 at 18:50 UTC
Is this a case where adding the `/o` modifier to the `s///` operator might be helpful? It's an efficiency measure intended to ensure (or help ensure, I think) the regex is only compiled once. `s/((.)\2{$n})\2*/$1/go;` [download]	[reply] [d/l] [select]
Re^4: Supress similar chars in the string by Anonymous Monk on Jan 24, 2011 at 02:38 UTC
Re^3: Supress similar chars in the string by AnomalousMonk (Archbishop) on Jan 18, 2011 at 22:26 UTC
`s/((.)\2{$n})\2/$1/g;` [download] Shouldn't the `\2` (zero or more of...) term in the regex above be `\2+` (one or more of...) instead? I.e.: `s/((.)\2{$n})\2+/$1/g;` Won't the `\2` version lead to useless replacement of sub-strings of n identical characters with the same n-character sub-string (Update:* where n is the maximum number of contiguous identical characters originally defined)? Update: This really should have been posted as a reply to Re: Supress similar chars in the string.	[reply] [d/l] [select]
Re^2: Supress similar chars in the string by Anonyrnous Monk (Hermit) on Jan 18, 2011 at 17:44 UTC
Newer Perls (>=5.10) also allow you to say `s/(.)\1\1\K\1*//g;` [download] i.e., through the use of `\K`, you don't need to copy the part of the match you want to keep (`$1` in your regex) into the substitution.	[reply] [d/l] [select]
Re^3: Supress similar chars in the string by AnomalousMonk (Archbishop) on Jan 18, 2011 at 22:36 UTC
`s/(.)\1\1\K\1//g;` [download] Shouldn't `\1` in regex above be `\1+` for reason similar to that discussed in Re^3: Supress similar chars in the string: useless replacement of empty string with empty string? (Or is regex engine smart enough to avoid this null operation?)	[reply] [d/l] [select]
Re^2: Supress similar chars in the string by AnomalousMonk (Archbishop) on Jan 19, 2011 at 18:44 UTC
... more rigorously correct ... optimization ... For me, the issue is not rigor or optimization, but correctness. Insofar as I understand it, the OP requires an overall match on more-than-n contiguous identical characters (although an n-character match may be a component of the overall match). The regexes to which I object match equal-to-or-more-than-n such characters overall, and so are incorrect. The incorrectness of the matches is obscured, but not ameliorated, by the subsequent substitutions, which may replace a string S (which may be empty) with an identical string S. The point can be exemplified by looking at what is matched by the `qr{ (.)\1\1 \K \1* }xms` regex (the most elegant, IMHO, in this thread) in a context that does not involve substitution. (Similar examples could be constructed for the other objectionable regexes.) All the empty strings are, in my view, unneeded (and improper) matches. `>perl -wMstrict -le "my $s = 'aaa...bbbb...ccccc...dddddd.x..yy'; print qq{'$2'} while $s =~ m{ (.)\1\1 \K (\1*) }xmsg; " '' '' 'b' '' 'cc' '' 'ddd'` [download] Anyway, that's my story and I'm sticking to it.	[reply] [d/l] [select]
Re^3: Supress similar chars in the string by kennethk (Abbot) on Jan 19, 2011 at 20:51 UTC
I would politely disagree with your interpretation, as the two approaches are functionally equivalent. There was no mention of side effects/counting in the OP, and so judging the proposed solution on those grounds is inappropriate. It is an inferior algorithm and likely a bad habit, but the code does the job and does it in a straight-forward fashion. If I'd spent more brain power considering the question, I would have substituted `+` for `*`. The modification actually occurred to me after I posted but before I'd read your comments. However, I try to avoid correcting posts for stylistic reasons alone to avoid versioning confusion and race conditions. I agree that the `\K` construct is the most elegant and appropriate solution, but not everyone runs 5.10 - my external server runs 5.8. The alternate choice option I list seems the most transparent for a neophyte to my eye, but I expect to be the the most expensive given the `e` modifier.	[reply] [d/l] [select]
Re: Supress similar chars in the string by suhailck (Friar) on Jan 19, 2011 at 02:37 UTC
`s/(.)\1{3,}/$1 x 3/ge` [download]	[reply] [d/l]
Re: Supress similar chars in the string by llancet (Friar) on Jan 19, 2011 at 01:34 UTC
Without using the regex, you may traverse through each character one by one, remember the previous character pattern and the pattern count. # suppose this is the input string my $str_in; # and this is the output my $str_out; my $char; my $char_count; my $len = length $str_in; for (my $i=0; $i<$len; $i++) { my $curr = substr $str_in,$i,1; if (!defined $char) { # initialize state $char = $curr; $char_count = 1; } else { if ($curr eq $char) { # current char is same with previous ones if ($char_count>=3) { $char_count++; } else { $str_out .= $curr; $char_count++; } } else { # we found a different char $char = $curr; $char_count = 1; $str_out .= $curr; } } } [download]	[reply] [d/l]
Re^2: Supress similar chars in the string by Jim (Curate) on Jan 19, 2011 at 02:13 UTC
Without using the regex, you may traverse through each character one by one, remember the previous character pattern and the pattern count. You may, but why would you write such a custom, special-case state machine in Perl? This is precisely what regular expressions in scripting languages like Perl are for. $ echo '........ aaaaaaaasssssss __________ ++++++++++ ---------' \| > perl -pe 's{(.)\1\1\K\1+}{}g' ... aaasss ___ +++ --- $ [download] That's unbeatably simple and elegant, don't you think?	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks