Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Regex: matching character which happens exactly once

by LanX (Saint)
on Oct 21, 2017 at 13:46 UTC ( [id://1201795]=perlquestion: print w/replies, xml ) Need Help??

LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

( DISCLAIMER this is a theoretical question about pure regexes, I know how to solve this in Perl, it's a follow up to this thread)

I'm banging my head at this problem, how do I match strings where at least one character happens exactly once, with a pure regex?

(i.e. without additional Perl code, especially embedded one, and without manipulating the input)

Finding all characters which aren't repeated afterwards is easy with a lookahead assertion

DB<200> p bab DB<200> x / (.) (?! .* \1 ) /gx 0 'a' 1 'b'

but combining with a lookbehind fails, b/c variable length is not permitted

DB<211> x / (.) (?<! \1 .* ) /gx Variable length lookbehind not implemented in regex m/ (.) (?<! \1 .* +) / at (eval 261)[C:/Perl_64/lib/perl5db.pl:646] l ine 2.

(actually already using the backreference \1 fails, since the placeholder has variable length)

So it boils down to the question:

  • How can I match all characters which appear for the first time?

All workarounds I found so far only work with ugly cheats, like hardcoding all cases for a fixed length string only.

I think it might be possible with recursive regexes and relative backreferences, but still ...

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re: Regex: matching character which happens exactly once
by haukex (Archbishop) on Oct 21, 2017 at 17:34 UTC

    Just to make sure I understand the question, you're looking for a regex that does the equivalent of the following regex, but without (?{ }) or (??{ })? (Anything else that's not allowed?) Personally, I don't find this particular (??{ }) construct too terrible...

    I played around a bit with recursion but then I was banging my head against the string 'abacbc', now I have to stop for today.

    use warnings; use strict; use Algorithm::Combinatorics 'variations_with_repetition'; use Test::More; my $re = qr{ \A ( .* ) ( (??{ length $1 ? "[^".quotemeta($1)."]" : "." }) ) ( (?: (?! \2 ) . )* ) \z }msx; for my $len (1..7) { my $iter = variations_with_repetition([qw/a b c/], $len); while ( my $c = $iter->next ) { my $str = join '', @$c; my $exp = do { my %h; $h{$_}++ for @$c; my %r = reverse %h; exists $r{1} }; is $str=~$re, $exp, $str.($exp?' =~':' !~').' re' or diag explain [ \%-, map [ $-[$_], eval "\$$_", $+[$_] ], 1..@+ ]; } } done_testing;

    Update: A bit of research on variable-width lookbehind: Why is variable length lookahead implemented while lookbehind is not?, Re: perl regexp question excluding strings, Not-really variable length lookbehind

      > Just to make sure I understand the question, you're looking for a regex that does the equivalent of the following regex,

      Yes, even without testing this looks right, your embedded Perl excludes all former characters in a character class.*

      > Personally, I don't find this particular (??{ }) construct too terrible...

      It's a theoretical question, I wouldn't use a regex anyway. It's about exploring the limits.

      > but without (?{ }) or (??{ })? (Anything else that's not allowed?) 

      Basically yes.

      The less cheating the better, ie a portable solution would be perfect. (Bigly doubt)

      Using implementation details or experimental features less so. (Sad )

      > Update: A bit of research on variable-width lookbehind:

      Regarding the older discussions:

      I don't see how \K could be applied here, the perldocs are IMHO wrong to claim it was generally allowing variable length look behinds.

      IMHO does \K only apply for s/// .

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      *) that's the approach I would try to implement with recursive patterns

        I spent some more time playing with this, trying out different things, like recursion, or backtracking control verbs, similar to what you did here. I also tried inverting the condition, that is, match strings where each character is repeated at least once. Sadly, test cases like "abacbc" and "abaxcbc" leave me doubting whether it is possible at all with pure regexes, although now that I've said that some regex wizard will probably prove me wrong ;-)

        So despite not having achieved success yet, I thought I'd share one or two of the avenues I explored. One thing I tried was thinking about it as a prefix + single character + suffix. Defining the latter two is fairly easy, but the prefix is not, since for the test case "abaxcbc" I haven't yet found a way for either the prefix to "mark" the second b as "seen", or the suffix to look back and notice that b has been seen before. This is what I was meant regarding variable-width lookbehind, the (*SKIP)(*FAIL) trick shown by AnomalousMonk looked interesting, but does not seem applicable here.

        Another, simpler problem is that in a string like "abccacb", the regex will easily misidentify the second a as the "single" character, since I couldn't find a way to look back - while I did manage to cover some of those cases with recursion, by doing stuff like / (?<r1>.)\g{r1}* (?&subpat) \g{r1}++ /, I still couldn't cover all the cases. For example, I couldn't come up with a way to match "abaxcbc" against /a.*a/, /b.*b/, and /c.*c/ simultaneously and in a way that advanced the match position in a useful manner. For example, in "abaxcbc"=~/b(.*)b/, $1 eq 'axc', and while cbc could be matched with a lookahead, how does one determine that the a has been seen before? Alternatively, one could match "abaxcbc"=~/ab(?=.*b)a/, but when you get to cbc, how do you determine that b has been seen before? (This is basically the same problem as mentioned above, no variable-width lookbehind.)

        Of course, I may have painted myself into a corner with all my (over)thinking, so now I'm really curious if there is a solution :-) (Update: There is!)

        I also spent some time on generating test cases, which was an interesting little problem in itself. Although I am out of time for now, maybe something in the following is useful:

        #!/usr/bin/env perl use warnings; use strict; use feature 'say'; use Test::More; my $re = qr{ \A ## Reference Implementation ## (?<prefix> .* ) (?<single> (??{ length $+{prefix} ? '[^'.quotemeta($+{prefix}).']' : '.' }) ) (?<suffix> (?: (?! \g{single} ) . )* ) \z }msx; my $re_nope = qr{ \A ## one of my early failed attempts # the prefix contains only characters that repeat later on (?<prefix> (?: (?<rep> . ) (?= .* \g{rep} ) )* ) # the single character obviously can't be part of the prefix (?<single> (?! \g{rep} ) . ) # the suffix must not contain the "single" character (?<suffix> (?: (?! \g{single} ) . )* ) \z }msx; # run the regex against the test cases for my $len (1..7) { my $odo = deduped_odo( map ['a'..chr(ord('a')+$_-1)], 1..$len ); while (my @c = $odo->()) { my $str = join '', @c; my $expect = do { my %h; $h{$_}++ for @c; my %r = reverse %h; exists $r{1} }; is $str=~$re, $expect, $str.($expect?' =~':' !~').' re' # or, for regexes that match strings with no single chars: #is $str!~$re2, $expect, $str.($expect?' !~':' =~').' re2' or diag explain [ \%-, map [ $-[$_], eval "\$$_", $+[$_] ], 1..@+ ]; } } done_testing; say "FAIL: $_->{name}" # a shorter summary of failures for grep { !$_->{ok} } Test::More->builder->details; exit; # test case generation stuff follows sub odometer { # http://www.perlmonks.org/?node_id=1197785 my @w = map { [ 1, ref eq 'ARRAY' ? @$_ : $_ ] } @_; my $done; return sub { if ($done) { $done=0; return } my @cur = map {$$_[$$_[0]]} @w; for (my $i=$#w; $i>=0; $i--) { last if ++$w[$i][0]<@{$w[$i]}; $w[$i][0]=1; $done=1 unless $i; } return wantarray ? @cur : join '', @cur; } } sub deduped_odo { my $odo = odometer(@_); # this sequence can probably be generated directly... my %seen; return sub { while (1) { my @next = $odo->() or return; my $pat = do { my ($i,%h)=('A'); join '', map {$h{$_}//=$i++} @next }; if ( not $seen{$pat}++ ) { return wantarray ? @next : join '', @next } } } }

        Update 2019-10-05: I've just released Algorithm::Odometer::Tiny!

Re: Regex: matching character which happens exactly once
by AnomalousMonk (Archbishop) on Oct 22, 2017 at 04:50 UTC

    Interesting problem. This is the best I can do so far. It extracts all singleton characters from a string. It needs Perl 5.10 regex extensions, but I think those are kosher. The  (?(condition)yes-pattern) is used with  (?{ code }) for the (condition) and I'm not sure if the stink of  (?{ code }) is dispelled by its use in a conditional regex expression. Of course, the most damning thing is the use of a hash to keep track of characters already seen, but I can't get around this (update: yet). (I'm running under 5.10 so I have to use a  local our %seen hash, but I understand that 5.18+ supports my variables at last.)

    File singleton_chars_1.pl:

    Output:


    Give a man a fish:  <%-{-{-{-<

      Nice, but you knew that the goal was not to embed any Perl. ;)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        True, but sometimes we must advance by baby steps. I haven't given up yet!


        Give a man a fish:  <%-{-{-{-<

Re: Regex: matching character which happens exactly once
by QM (Parson) on Oct 23, 2017 at 14:08 UTC
    For the sake of completeness, I submit one of the solutions I don't think you want, namely, creating a regex for each character of the possible alphabet.

    This wasn't as easy as I thought it would be, because capturing the actual match requires filtering out all of the unmatch (undef) alternatives. Perhaps someone can improve on this idea?

    #!/usr/bin/env perl # # Match a char only occuring once in a string. # See https://perlmonks.pairsite.com/?node_id=1201795 use strict; use warnings; my @alpha = ('a'..'b'); my $alpha = join(',', @alpha); my @input = glob "{$alpha}" x 12; my @regex; for my $c (@alpha) { my $d = quotemeta($c); push @regex, "(?:[^$d]*)([$d])(?:[^$d]*)"; } my $once = join('|', @regex); my $matches; my $attempts; for my $i (@input) { my(@catch) = $i =~ /^(?:$once)$/; if (@catch) { my $catch = join('', grep {defined($_)} @catch); # printf "(%s) => (%s)\n", $catch, $i; $matches++; } $attempts++; } printf "matches: %s/%s\n", $matches, $attempts; exit; # Output: # > pm1201795.pl # matches: 24/4096

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Regex: matching character which happens exactly once
by haukex (Archbishop) on Oct 24, 2017 at 10:46 UTC
      Hauke

      I kept your replies on my growing to-do list but I'm realising now that I'm won't be able to "dive" into the matter again.

      Thanks for your help, I'm closing the case now.

      > Rolf, I blame you for the sleep I lost over this!

      And thanks for taking care of my insomnia. ;-p

      Honestly, while the use case is very exotic I learned a lot about the various corners of the regex features. I hope for you too!

      Der Weg war das Ziel! :)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

Re: Regex: matching character which happens exactly once
by QM (Parson) on Oct 23, 2017 at 14:15 UTC
    Also, it seems you want a forward reference, in the same spirit as a backref -- something the regex engine will go check after "filling" the ref:
    my ($match) = /^(?:[^\1]*)(.)(?:[^\1]*)$/;

    Here rendered as \1, just like the backref.

    As the first capturing paren hasn't been encountered before the backref mention triggers different behavior in the regex path. This will make backtracking more painful, and possibly more likely to be pathological, but one has to assume some risk...

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      I understand that the  [^\1] regex expression presented here is intended as pseudocode, but in addition to the radical changes to backrferencing it implies, there's another problem: the syntax of character classes would have to change radically to support it. Something like  \1 in a character class is compiled as an octal character representation:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $rx = qr{ [\1] }xms; print $rx; print 'match' if qq{\cA} =~ $rx; " (?msx-i: [\1] ) match


      Give a man a fish:  <%-{-{-{-<

        Yes. Like LanX, you're getting bogged down in implementation.

        What do you want it to do with that syntax? How would you redefine Perl regexes to do this?

        I think another symbol for backref would help. (I don't know what we'd use, but that's a different problem.) Then use the same thing for forward references. And for fun, we'll call those ferkcabs.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

      This can't work, because the first \1 will be always set to the last match ahead (and undef or "" at first encounter)

      This is because $1 is a global var will keep match instead of erasing when backtracking.

      FWIW I tried something similar by capturing the following character in $2 for the next run:

      m/ ^ (?:(?!\2).)*? (.) (?=(.|$)) (?!.*\1) /x

      But couldn't get it to work, probably because the regex engine is not considering another defined \2 while backtracking. (or probably b/c I was too tired last night)

      DB<310> @inp = glob '{a,b}'x3 DB<311> ;m/ ^ (?:(?!\2).)*? (?{say "<$_ $2>"}) (.) (?=(.|$)) (?!.* +\1) /x and say ("found $1 in $_") for @inp <aaa > <aab > <aba > <aba b> found b in aba <abb > found a in abb <baa > found b in baa <bab > <bab a> found a in bab <bba > <bbb > DB<312>

      probably I'm having a bug in my logic, experts to the rescue! ;-)

      Didn't have the time yet for proper debugging.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        I didn't expect my forward ref to work, if that's what you thought. The regex engine would have to include a flag to indicate when the first capturing parens were seen, and do the right thing.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Regex: matching character which happens exactly once
by Anonymous Monk on Oct 21, 2017 at 14:12 UTC
    example inputs + expected outputs ?
      > example inputs + expected outputs ?

      here a generic answer (for the general question of a character appears exactly once)

      DB<259> @input = glob '{a,b}' x 5 DB<260> test($_) and print "OK $_\n" for @input OK aaaab OK aaaba OK aabaa OK abaaa OK abbbb OK baaaa OK babbb OK bbabb OK bbbab OK bbbba DB<261> l test 2: sub test { my $str = shift; my %h; $h{$_}++ for split //,$str; + my %r =reverse %h; return 1 if exists $r{1} }; DB<262>

      YMMV concerning input size, but keep in mind that its $size_alphabet**$word_length .

      The regex should work for all potential possibilities, neither size nor alphabet can be known.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

Re: Regex: matching character which happens exactly once
by QM (Parson) on Oct 23, 2017 at 17:00 UTC
    Just to further the theoretical aspect along...

    What would a TM program look like that recognizes such a string?

    For instance, would it have to search backward and forward from each char position and count occurrences? Is there a better way?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1201795]
Approved by davies
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-29 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found