Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Puzzlement in Splitsville

by tlm (Prior)
on Jun 19, 2005 at 12:43 UTC ( [id://468106]=perlmeditation: print w/replies, xml ) Need Help??

This is a question disguised as a meditation disguised as a puzzle. (I think Meditations is the best section for it, but, most esteemed janitors, please feel free to move it to SoPW, or wherever else is more appropriate.)

NB: If you are a split-meister, you may want to cut to the chase, and go straight to The Question below.

The Puzzle

OK, as Perl puzzles go this is not a very hard one, but I think it will still be interesting to those who haven't already seen something like it. Find a simple way to split a string into substrings of length 3, say (the last chunk may be shorter, if the length of the string is not divisible by 3). (For simplicity, assume the string contains no newlines...or trailing 0s [thanks, Smylers].) For example, if the input string is

atgactaatagcagtgg
the output should be the list
0 'atg' 1 'act' 2 'aat' 3 'agc' 4 'agt' 5 'gg'

What trips one in such a puzzle (or at least tripped me) is the word "split" in the posing of it, which leads one immediately to think of Perl's split builtin function. It is possible to use split for this, but the only solution I know of requires a filtering through grep:

@codons = grep $_, split /(.{3})/, 'atgactaatagcagtgg'; print "@codons\n"; __END__ atg act aat agc agt gg
Note that the parens are required in the regex. (If it's not clear why, see split, in particular the role of capture in the regex argument.)

But a simpler solution requires only m//g, without any filtering:

@codons = 'atgactaatagcagtgg' =~ /.{1,3}/g; print "@codons\n"; __END__ atg act aat agc agt gg
Note that parens are not needed in this case, but it is necessary to use the range quantifier {1,3} instead of the "exact" quantifier {3}.

The Question

OK, that was all preamble to my real question, which is, is there a simple regex-based solution to split a string into "runs" of the same character? For example, if the input is 'aaabbcddddaee', then the output should be the list

0 'aaa' 1 'bb' 2 'c' 3 'dddd' 4 'a' 5 'ee'
The best I can come up with is the gangly:
@runs = do { my $i; grep ++$i%2, 'aaabbcddddaee' =~ /((.)\2*)/g }; print "@runs\n"; __END__ aaa bb c dddd a ee
I'd be interested in learning of more elegant solutions.

Update: In response to BrowserUk's question, yes order matters.

Update2: Fixed puzzle's statement, in response to Smylers' observation.

The Other Question

Incidentally, what makes my last solution so awkward is the extraneous machinery to get rid of every other item in the list returned by m//g. Is there a better idiom for selecting (or filtering out) every n-th item from a list (not an array!) of unknown length? (Of course, if an idiom requires hauling in a module, it is automatically somewhat lame, particularly if it's a non-core module.)

the lowliest monk

Replies are listed 'Best First'.
Re: Puzzlement in Splitsville
by thundergnat (Deacon) on Jun 19, 2005 at 13:33 UTC

    I don't know if it is any more elegant....

    push @runs, $1 while 'aaabbcddddaee' =~ /((.)\2*)/g; print "@runs\n";
      Update: I misread the question.

      thundergnat's pattern doesn't work for me with my 5.6.1, it puts every character into an element.

      I had: push @ans, $1 while $str =~ /(.{3}|.{1,2}$)/g;

      Be well,
      rir

Re: Puzzlement in Splitsville
by BrowserUk (Patriarch) on Jun 19, 2005 at 13:24 UTC

    Does order matter?

    print keys %{{ 'aaabbcddddaee' =~ m[((.)\2*)]g }}

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Puzzlement in Splitsville
by ysth (Canon) on Jun 19, 2005 at 18:41 UTC
    If you have a brief alphabet (intimated by your earlier example using just actg), this is easy:
    print join " ", "aaabbcddddaee" =~ /(a+|b+|c+|d+|e+)/g
Re: Puzzlement in Splitsville
by eyepopslikeamosquito (Archbishop) on Jun 19, 2005 at 14:30 UTC

    thundergnat's solution is the best I can see (I came up with it independently) because it avoids the ugly removal of every 2nd item. However, because variety is the spice of life I present the ganglier:

    my @runs = do { my $i; grep ++$i%2, split /(?<=(.))(?!\1)/, 'aaabbcddd +daee' };
    which suffers from the same annoyance as your original solution: the parens, which are needed for the back reference, have the annoying side effect here of changing the semantics of what split returns.

Re: Puzzlement in Splitsville (/g modifier)
by demerphq (Chancellor) on Jun 20, 2005 at 11:53 UTC

    Id say you have got yourself into a situation of functional fixidity. Instead of using the list form of regexes, use the /g modifier with while. Its intended for exactly this type of scenario. Also, for your first question about splitting into triplets or what have you its probably more efficient to use unpack if you can. Im not sure but its possible 5.6 doesnt support the pattern grouping in pack formats.

    D:\>perl -le "print $1 while 'aaabbcddddaee'=~/((.)\2*)/gs" aaa bb c dddd a ee
    D:\>perl -le "print for unpack '(a3)*','atgactaatagcagtgg'"; atg act aat agc agt gg

    Note the difference in approaches between using for and while. For expects the list to be preconstructed, while expects the list to be constructed as we go.

    UPDATE: Apologies to thundergnat, I didn't notice his comment before i posted.

    UPDATE2: To answer your question about eliminating every $N th element id say you have the right approach. I would do it via:

    my @filtered=do { my $i=1; grep $i++ % $N,@unfiltered};

    ---
    $world=~s/war/peace/g

Re: Puzzlement in Splitsville
by Anonymous Monk on Jun 20, 2005 at 09:20 UTC
    In Perl6, this can be done using a single, 5 Unicode-character long, operator.
Re: Puzzlement in Splitsville
by BrowserUk (Patriarch) on Jun 20, 2005 at 09:47 UTC
    In response to BrowserUk's question, yes order matters.

    I guess you could do the FP thing:

    sub oddEls; sub oddEls{ @_ ? (( shift, shift )[ 0 ], oddEls @_) : () } print oddEls 'aaabbcddddaee' =~ /((.)\2*)/g;; aaa bb c dddd a ee

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Your solution works only because oddEls always gets an even length list. Replace @_ ? with @_ > 1 ?, and oddEls also works with odd length lists.

        Are you sure?

        print oddEls 1 .. 9;; 1 3 5 7 9 print oddEls 1 .. 10;; 1 3 5 7 9

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Puzzlement in Splitsville
by Smylers (Pilgrim) on Jun 20, 2005 at 14:03 UTC

    It is possible to use split for this, but the only solution I know of requires a filtering through grep:

    @codons = grep $_, split /(.{3})/, 'atgactaatagcagtgg';

    As demerphq pointed out, unpack is a better way to achieve this. But if you're going to use split, note that that grep condition is wrong. In particular if the last character is a zero and it should be in an an element of its own then it will get omitted:

    my @codons = grep $_, split /(.{3})/, 'atgactaatagcagt0';

    Explicitly testing for the empty string is required:

    my @codons = grep { $_ ne '' } split /(.{3})/, 'atgactaatagcagt0';

    Smylers

      In particular if the last character is a zero and it should be in an an element of its own then it will get omitted:

      if you say grep length($_), ... you avoid this problem...

      ---
      $world=~s/war/peace/g

A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://468106]
Approved by BrowserUk
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-28 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found