Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Supress similar chars in the string

by Lana (Beadle)
on Jan 18, 2011 at 17:13 UTC ( [id://882930]=perlquestion: print w/replies, xml ) Need Help??

Lana has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks! I am stuck with this, as I thought, simple task. Let's say I have some input string like
testing 1234567 ........ aaaaaaaasssssss __________ ++++++++++ ------- +-- testing testing
but I don't want to allow more than three similar chars one by one and I need the string to be modified to:
testing 1234567 ... aaasss ___ +++ --- testing
I played with regex but have no idea how to identify similar chars. What is the best way to make this filter? It will be used on big strings, say a few megabytes.

Replies are listed 'Best First'.
Re: Supress similar chars in the string
by kennethk (Abbot) on Jan 18, 2011 at 17:22 UTC
    You can accomplish your desired result using Backreferences. With your sample input, I would use the code:

    #!/usr/bin/perl use strict; use warnings; $_ = 'testing 1234567 ........ aaaaaaaasssssss __________ ++++++++++ - +-------- testing testing '; s/((.)\2\2)\2*/$1/g; print;

    to output

    testing 1234567 ... aaasss ___ +++ --- testing testing

    YAPE::Regex::Explain describes the regex as:

    ( group and capture to \1: ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- . any character except \n ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- \2 what was matched by capture \2 ---------------------------------------------------------------------- \2 what was matched by capture \2 ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- \2* what was matched by capture \2 (0 or more times (matching the most amount possible))

    Note I have kept the third 'testing' in the string, which seems to follow spec but not your example.

    Update: As AnomalousMonk points out below, swapping \2* to \2+ is more rigorously correct and avoids some unnecessary no-ops. However, I didn't think this particular optimization was likely to have sufficient impact to warrant changing the post content. For a cleaner version, you can also use s/(.)\1{3,}/$1 x 3/eg; (or s/(.)\1{$n,}/$1 x $n/eg;).

      And if the OP ever had the need to scale it up to more than 3 characters:
      use strict; use warnings; my $n = 3; # number of adjacent characters $n--; $_ = 'testing 1234567 ........ aaaaaaaasssssss __________ ++++++++++ - +-------- testing testing '; s/((.)\2{$n})\2*/$1/g; print;

        Is this a case where adding the /o modifier to the s/// operator might be helpful? It's an efficiency measure intended to ensure (or help ensure, I think) the regex is only compiled once.

        s/((.)\2{$n})\2*/$1/go;
        s/((.)\2{$n})\2*/$1/g;

        Shouldn't the  \2* (zero or more of...) term in the regex above be  \2+ (one or more of...) instead? I.e.:
            s/((.)\2{$n})\2+/$1/g;
        Won't the  \2* version lead to useless replacement of sub-strings of n identical characters with the same n-character sub-string (Update: where n is the maximum number of contiguous identical characters originally defined)?

        Update: This really should have been posted as a reply to Re: Supress similar chars in the string.

      Newer Perls (>=5.10) also allow you to say

      s/(.)\1\1\K\1*//g;

      i.e., through the use of \K, you don't need to copy the part of the match you want to keep ($1 in your regex) into the substitution.

        s/(.)\1\1\K\1*//g;

        Shouldn't  \1* in regex above be  \1+ for reason similar to that discussed in Re^3: Supress similar chars in the string: useless replacement of empty string with empty string? (Or is regex engine smart enough to avoid this null operation?)

      ... more rigorously correct ... optimization ...

      For me, the issue is not rigor or optimization, but correctness. Insofar as I understand it, the OP requires an overall match on more-than-n contiguous identical characters (although an n-character match may be a component of the overall match). The regexes to which I object match equal-to-or-more-than-n such characters overall, and so are incorrect. The incorrectness of the matches is obscured, but not ameliorated, by the subsequent substitutions, which may replace a string S (which may be empty) with an identical string S.

      The point can be exemplified by looking at what is matched by the  qr{ (.)\1\1 \K \1* }xms regex (the most elegant, IMHO, in this thread) in a context that does not involve substitution. (Similar examples could be constructed for the other objectionable regexes.) All the empty strings are, in my view, unneeded (and improper) matches.

      >perl -wMstrict -le "my $s = 'aaa...bbbb...ccccc...dddddd.x..yy'; print qq{'$2'} while $s =~ m{ (.)\1\1 \K (\1*) }xmsg; " '' '' 'b' '' 'cc' '' 'ddd'

      Anyway, that's my story and I'm sticking to it.

        I would politely disagree with your interpretation, as the two approaches are functionally equivalent. There was no mention of side effects/counting in the OP, and so judging the proposed solution on those grounds is inappropriate. It is an inferior algorithm and likely a bad habit, but the code does the job and does it in a straight-forward fashion. If I'd spent more brain power considering the question, I would have substituted + for *. The modification actually occurred to me after I posted but before I'd read your comments. However, I try to avoid correcting posts for stylistic reasons alone to avoid versioning confusion and race conditions.

        I agree that the \K construct is the most elegant and appropriate solution, but not everyone runs 5.10 - my external server runs 5.8. The alternate choice option I list seems the most transparent for a neophyte to my eye, but I expect to be the the most expensive given the e modifier.

Re: Supress similar chars in the string
by suhailck (Friar) on Jan 19, 2011 at 02:37 UTC
Re: Supress similar chars in the string
by llancet (Friar) on Jan 19, 2011 at 01:34 UTC
    Without using the regex, you may traverse through each character one by one, remember the previous character pattern and the pattern count.
    # suppose this is the input string my $str_in; # and this is the output my $str_out; my $char; my $char_count; my $len = length $str_in; for (my $i=0; $i<$len; $i++) { my $curr = substr $str_in,$i,1; if (!defined $char) { # initialize state $char = $curr; $char_count = 1; } else { if ($curr eq $char) { # current char is same with previous ones if ($char_count>=3) { $char_count++; } else { $str_out .= $curr; $char_count++; } } else { # we found a different char $char = $curr; $char_count = 1; $str_out .= $curr; } } }
      Without using the regex, you may traverse through each character one by one, remember the previous character pattern and the pattern count.

      You may, but why would you write such a custom, special-case state machine in Perl? This is precisely what regular expressions in scripting languages like Perl are for.

      $ echo '........ aaaaaaaasssssss __________ ++++++++++ ---------' | > perl -pe 's{(.)\1\1\K\1+}{}g' ... aaasss ___ +++ --- $

      That's unbeatably simple and elegant, don't you think?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://882930]
Approved by kennethk
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-20 02:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found