Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

regex, pos, \G, and substr

by ff (Hermit)
on Jun 03, 2007 at 01:53 UTC ( [id://618946]=perlquestion: print w/replies, xml ) Need Help??

ff has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks
My users create strings which contain text like update 8923 mark complete. I'd like to allow them to create strings like update 8435 and 9323 mark complete and convert those into multiple strings that look like the old pattern, i.e. update 8435 mark complete update 9323 mark complete. The following snippet does just what I want, but the Camel Book, in describing the \G assertion, says

Whenever you start thinking in terms of the pos function, it's tempting to start carving your string up with substr, but this is rarely the right thing to do.

So, should I consider doing something else? Thanks.

#!/usr/bin/perl -w use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk'; pos( $data_stg ) = 0; my %mult_updates; my $pass = 1; while ( $data_stg =~ /(update \d+( and \d+)+ mark complete)/ig ) { my $mult_update_pos = pos( $data_stg ); print "$pass pos: '$mult_update_pos'\n"; my $mult_update = $1; print "$pass orig_mult_update: '$mult_update'\n"; my $mult_update_length = length $mult_update; print "$pass length: '$mult_update_length'\n"; $mult_update =~ s/and (\d+)/mark complete update $1/gi; print "$pass new_mult_update: '$mult_update'\n"; $mult_updates{ $mult_update_pos - $mult_update_length } = [ ($mult_update_length, $mult_update) ]; } continue { $pass++; } # Work backwards from the end of the string, doing substr # on positions which have been identified as having code to # replace. Let the key define a starting position and the # key's value contain an array ref describing the length # of the target and the desired replacement text. foreach ( sort {$b <=> $a} keys %mult_updates ) { substr( $data_stg, $_, $mult_updates{$_}->[0], $mult_updates{$_}->[1] ); } print "\n$data_stg\n";

Replies are listed 'Best First'.
Re: regex, pos, \G, and substr
by BrowserUk (Patriarch) on Jun 03, 2007 at 02:29 UTC

    This seems somewhat simpler, though you might want to strengthen the regex to validate the input more.

    #! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[update (.+?) mark complete]{ join ' ', map{ "update $_ mark complete"} split '\s+and\s+', $1 }ge; print $data_stg;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I think it's perceptive to split the guts of the phrases on [ ]and[ ], but it's really important in my case that the leftovers are only digits. While I could throw grep { /^\d+$/ } in front of the split, I'd lose visibility to any non-digit stuff that was (mistakenly) there in the process of following through with the replace side of the (s)ubstitute operator. In other words, I'd rather leave everything alone if there's anything "non-digit" besides the and splitters in there. BTW, I like the single-quotes for delimiting the split regex.

        That's what I meant by strengthening the regex. Note that the non-conformant additional third line is left untouched:

        #! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete more junk update junk and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[update ((?:\d+|\s|and)+) mark complete]{ join ' ', map{ "update $_ mark complete"} split '\s+and\s+', $1 }ge; print $data_stg; __END__ ## Output wrapped to match input for easier verification. junk text update 8923 mark complete update 8324 mark complete more junk update 5438 mark complete update 5843 mark complete + update 1522 mark complete more junk update junk and 5843 and 1522 mark complete update 8435 mark complete update 9323 mark complete true junk

        Alternatively, verify that the split values are numeric, produce a warning and put the original back if not:

        #! perl -slw use strict; my $data_stg = 'junk text update 8923 mark complete update 8324 mark ' . 'complete more junk update 5438 and 5843 and 1522 mark ' . 'complete more junk update junk and 5843 and 1522 mark ' . 'complete update 8435 and 9323 mark complete true junk' ; $data_stg =~ s[(update (.+?) mark complete)]{ my @numbers = split '\s+and\s+', $2; if( grep{ !/^\d+$/ } @numbers ) { warn "Malformed request: '$1'\n"; $1; } else{ join ' ', map{ "update $_ mark complete"} @numbers; } }ge; print $data_stg; __END__ ## Output wrapped to match input for easier verification. Malformed request: 'update junk and 5843 and 1522 mark complete' junk text update 8923 mark complete update 8324 mark complete more junk update 5438 mark complete update 5843 mark complete + update 1522 mark complete more junk update junk and 5843 and 1522 mark complete update 8435 mark complete update 9323 mark complete true junk
        BTW, I like the single-quotes for delimiting the split regex.

        Most don't. They consider it a bad habit of mine.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'd rather leave everything alone if there's anything "non-digit" besides the and splitters in there.
        Then leave that part the same as in your original looping regex:
        s[update (\d+(?: and \d+)+) mark complete]{...}ge;
Re: regex, pos, \G, and substr
by moritz (Cardinal) on Jun 03, 2007 at 09:20 UTC
    If you want to be ultra lazy and your data is not read by other programs that you have no control of, you might use a common serialization format like yaml, xml or json.

    Then you could read and write them with the appropriate CPAN modules and be pretty sure that it works as expected.

      Hey. XSLT lets you write whole computer programs in XML, so maybe Perl6 should written in YAML or JSON.

      It would do away with all that complicated syntax and the need to use horrible, nasty, complicated things like regexes.

      We could just load up a cpan module and Perl6 would be ready by next weekend. And we could be sure it worked properly.

        The point of this reply is completly obscure to me.

        I have been doing a lot of XSLT this last year at $work. And the ability to use regular-expressions in XSLT/XPath is something which is well sought after.

        XSLT also isn't the last conclusion of wisdom with regard to high level programming, IMHO, though it certainly has its niche where it might be concidered useful, e. g. to avoid a "media break".

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://618946]
Approved by liverpole
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-19 21:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found