http://qs321.pair.com?node_id=191980

moxliukas has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I have been writing a regexp that would transform this:

$s = 'aaaabababbbbaaaccccbbbbbbaadddd';

into

$s = '4ababa4b3a3a4c6b2a4d';

Basicly it is something similar to mathematical series test (ummm... not sure if this is the correct translation from Lithuanian) where subsequent occurrences of the same character are counted (except no number would be inserted if there is only one character).

I have been trying to come up with a regexp that would do this transformation and I got to the point where everything works:

$s = 'aaaabababbbbaaaccccbbbbbbaadddd'; $s =~ s"($_{2,})"length($1).$_"ge for ('a'..'d'); print $s;

However I am not very happy with the for loop. I wonder if the same can be achieved in one regexp, without the need to scan the line for each character. Can character classes be somehow involved in the regexp to avoid looping?

Thanks for any help in advance.

Replies are listed 'Best First'.
Re: Regexp: can I do it in one go?
by jmcnamara (Monsignor) on Aug 22, 2002 at 11:34 UTC

    You can use a backreference to obtain a single regex:
    #!/usr/bin/perl -wl use strict; my $s = 'aaaabababbbbaaaccccbbbbbbaadddd'; print $s; $s =~ s/((.)\2+)/length($1) . $2/eg; print $s; __END__ Prints: aaaabababbbbaaaccccbbbbbbaadddd 4ababa4b3a4c6b2a4d

    --
    John.

      Thanks a lot. I can't believe that I didn't think about it this way ;)

      Thank you again

Re: Regexp: can I do it in one go?
by Arien (Pilgrim) on Aug 22, 2002 at 11:31 UTC

    What you want to do is globally match a something including possible repetitions, and replace what you've found with that something followed by the length of your match:

    $s =~ s/((.)\2*)/$2 . length $1/eg;

    — Arien

    Edit: It seems I misread the output you want. To only have sequences of two or more repeated letters replaced, change the star to a plus sign. (And after some sleep...) Also, you'd want to swap length $1 and $2 to have the length preceed the letter.