Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

regexp golf - homework

by Boots111 (Hermit)
on Feb 02, 2002 at 19:14 UTC ( [id://142950]=perlquestion: print w/replies, xml ) Need Help??

Boots111 has asked for the wisdom of the Perl Monks concerning the following question:

Monks~

Let me first start by admitting that this was a homework problem I had. All the same, I am curious to see what kind of solutions can be had here because I have yet to come up with any.

The problem:
"Write a regular expression that matches all strings of digits with no repeated digit"
Thus 01534 would pass, and 01034 would fail.

I figured that the easiest way to do this would be to start with a restricted alphabet {0,1,2,3} for example. I also decided to require that all four numbers show up (since simply adding ?'s everywhere would allow fewer numbers)

/
(0(1(23|32)|2(13|31)|3(12|21))
|1(0(23|32)|2(03|30)|3(02|20))
|2(0(13|31)|1(03|30)|3(01|10))
|3(0(12|21)|1(02|20)|2(01|10))
)/

Solves this problem, but only slightly more easily than brute force.

As you might guess, this solution does not scale well to the 10 digit case... (actually it is about 17 megs...)

My next thought was to combine some of these cases:

/
((01|10)(23|32)
|(02|20)(13|31)
|(03|30)(12|21)
|(12|21)(03|30)
|(13|31)(02|20)
|(23|32)(01|10)
)/

The first solution has 121 characters and the second 91; I think that the second might scale to the 10 digit case more easily too.

Oh yeah! I forgot about the cruelest part of all. You are restricted to the "basic" regexp operators (),?,|,* and no others.

My impulse here is to think that there is some trade off between allowing the |'s to expand and just brute forcing all cases. My intuitition as a math major is to say that 3 is the sacred number afterwhich the brute force is no longer the better option. But I have not bothered with any sort of justification of that.

Good luck,
Matt

Replies are listed 'Best First'.
Re (tilly) 1: regexp golf - homework
by tilly (Archbishop) on Feb 02, 2002 at 22:50 UTC
    particle's answering of the exact reverse of your question notwithstanding, this problem is impossible to solve without at least adding in anchors.

    With anchors you are able to solve it, and your second try is on the right path. What you want to do is split the problem into 252 divisions of the digits into subsets of 5 and 5, take each subset and split it into subsets of 2 and 3, then list combinations of those to get something that will match all digits in any order. Add ?s to take into account missing digits.

    The main point of an exercise like this is to demonstrate that there are simple problems which REs are a very bad fit for. This is important to understand because far too often people get into the frame of mind that, "Oh, I will just use an RE for this!" and then they have two problems instead of one.

    Incidentally this is very easy to solve with negative lookaheads, but that fact merely obscures the underlying point. Adding features to an RE engine extends what you can do with it, but doesn't address the fact that many tasks are just a bad fit for doing with REs.

      Ah Hah! Thanks tilly ++. Because it was described as a `homework' problem I thought I'd be able to come up with something without looking `in the back of the book'.

      However after a few scribbles (and my Camel and Cookbook are at work, and I'm not) I came to a grinding halt.

      Looking at particle's solution made me realise I should have been able to think it through to that type of solution and I didn't actually pick it as `breaking the rules'.

      However your points are very well made... when you've got the RE `hammer' every string problem looks like a nail! Once I understood this, I looked for a different approach.

      This approach would involve using a hash, splitting the string into digits, counting the unique keys and checking for any with a count (value) more than one. Now that's a homework problem I can handle 8-).

      Thanks for insight. hagen
Re: regexp golf - homework
by ehdonhon (Curate) on Feb 02, 2002 at 19:35 UTC

    Interesting problem, and ++ for admiting it was homework!

    I'm a little stumped by this one myself. If the problem were inverted ("Write a regexp that matches strings containing matching digits"), it would be a lot easier:

    /(0.*0)|(1.*1)|(2.*2)|(3.*3) ...  (you get the idea)
    

    I'm curious, do regular expressions have an equivalent of Demorgan's law in boolean expressions that allows you to invert an expression?

    Best of luck to you, please let us know what you find out.

      Yes, it is always possible to invert a (mathematical) regular expression. I don't know of a simple method for doing it, though. You have to convert the regex to an NFA (nondeterministic finite automaton), remove the nondeterminism to produce a DFA, switch the accepting and nonaccepting states, and then convert the DFA back to a regex. It's nasty, and it's not guaranteed to give you the "best" possible regex. Also, the DFA for this problem has 2**N+1 states, where N is the number of digits.

      In the final step, converting the DFA to a regex, you have a choice as to which states you're going to prune off first. Boots111's first solution corresponds to pruning from left to right, while the second solution corresponds to working from the outside in. Extending the second solution to 10 digits would result in 252 branches in the top-level parenthesis group, and each of those branches would contain the solutions to two different 5-digit problems. Ick.

      particle has a good idea of using backreferences. Those don't exist in "mathematical" regular expressions. The person who posed the problem might think of them as extended operators, in which case they wouldn't be allowed here. Not for me to say.

Re: (GOLF) - multiple digit finder regex - 17 chars
by particle (Vicar) on Feb 02, 2002 at 21:38 UTC
    it's not as difficult as you may think. the regex looks like this: /.*?([0-9]).*?\1/. including the brackets, it's 17 characters long! here's the full program:

    #!/usr/local/bin/perl -w use strict; $|++; my @numlist; push @numlist, int(rand 100000) for 1..10; for(@numlist) { next if /^$/; /.*?([0-9]).*?\1/ ? print "$_\tmultiple $1\n" : print "$_\tsingle\ +n"; }
    i generated a list of random five digit numbers to feed the regex. here's a breakdown of the regex:
    .*? i match zero or more characters, non-greedily
    ([0..9]) match numbers 0..9, save in \1
    .*? again match zero or more characters, non-greedily
    \1 find the next instance of the character in \1 *

    *what's most interesting about this, is that $1 does not work in place of \1. instead, on my perl 5.6.1 install, it warns with:

    Use of uninitialized value in concatenation (.) or string at test_matc +h1.pl line 10. Nested quantifiers before HERE mark in regex m/.*?([0-9]).*?+ << HERE +/ at test_match1.pl line 10.
    by the way, there's nothing to stop this from working with letters, as well. just change 0..9 and you're on your set.

    enjoy!

    Update: i see the error of my ways. tilly's right (below). i solved the exact opposite of the problem, which was pretty easy! whoops!

    ~Particle

      It is nice, but it broke the rules :^( The post says no characters other then (),?,|,* this rules out the character class [] I believe. Some body please correct me if I am wrong. My own version that breaks the rules by using \d and not using the regex itself for success, but still a fun brain exercise is this:
      S:while(<DATA>){%d=();while(/(\d)/g){if($d{$1}) {print "DUP: $1 in $_";next S}$d{$1}=1}} __DATA__ 12345 012345 894376 3885437 98374 30924 03284098325 094385098432 74362 876543 4456789
      /.*?([0-9]).*?\1/ *what's most interesting about this, is that $1 does not work in place of \1.
      Well, of course! $1 and \1 mean different things. When $1 is used in a regex, its value is interpolated as the regex is compiled, just like any other variable. For example:
      'b' =~ /(.)/; print "Yup!\n" if 'ab' =~ /(.)$1/;
      prints Yup! because $1 holds 'b' from the first match and the second regex is compiled as /(.).*c/.

      \1, on the other hand, is special regex syntax, and matches the same thing that was matched by the first capturing group in the current regex.

      'b' =~ /(.)/; print "Yup!\n" if 'ab' =~ /(.)\1/;
      does not print Yup!, because \1 wants to match an 'a'.

      (\1 on the right hand side of a substitution, with the same meaning as $1, has been deprecated for quite some time.)

      Update: Fixed the explanation; an earlier revision of the example used 'c' instead of 'b'.

        (\1 on the right hand side of a substitution, with the same meaning as $1, has been deprecated for quite some time.)

        yeah. that's how i was used to seeing \1. from people learning perl who are used to vi, ex, etc. on the right hand side, it's a no-no.

        although, there's a problem with the explanation of your first example. $1 can't hold 'c', it's nowhere to be found. but i understand the difference now. docs are good.

        ~Particle

      I think that solving the opposite porblem means solving the original problem. Just filter the input using /\d+/ to insure that they are composed by only digits. Then, solve the problem in the opposite way.
Re: regexp golf - homework
by Ido (Hermit) on Feb 02, 2002 at 21:28 UTC
    It doesn't really answer the question, but if you could use (??{}), this would do it:
    /^((??{'[^\D'.join('',@c,push @c,$1).']'}))*$/
Re: regexp golf - homework
by LanX (Saint) on Dec 11, 2012 at 06:45 UTC
    Ehm ... what am I missing here? like IMHO just searching for a repeated digit like with /(\d).*\1/ and inverting the result should do.

    DB<111> print $_ !~ /(\d).*\1/ ? "$_ ok\n" :"$_ has <$1>\n" for qw/0 +12 0102 01233/ 012 ok 0102 has <0> 01233 has <3>

    Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://142950]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-04-24 01:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found