Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

contest problem about shortest Regular expression

by rsFalse (Chaplain)
on Nov 28, 2017 at 21:11 UTC ( [id://1204472]=perlmeditation: print w/replies, xml ) Need Help??

Hello.

A week ago I met an interesting problem about regular expressions. It was from the OpenCup contest held in 2017 november 19. Here I will post a statement, and I ask to discuss how to solve this problem, what strategies and tools to use. I recommend to try to solve this problem by yourselves, and if you would like, you can copy pdf statements with this link, problem number 7 (You can download statement, while the link works, but you can't access and upload your solution for a testing system if you are not participating with your team from your high school). I used perlbrew switch v5.14.4 to match a version of testing system, and before I was failing with v5.18 because of some newer features which I used. At the bottom I will write my approaches and code.

Statement:
Regular Expressions Time limit 10 seconds Memory limit 256Mb Input standard input or input.txt Output standard output or output.txt Pavel doesn’t have a college education. He’s a natural programmer. He +loves brief, laconic bits of code more than anything else in the worl +d. One-liners are his favorite! Striving for shorter code he often ma +kes use of regular expressions. There is nothing more creative and en +joyable than working on yet another regular expression in the middle +of the night. But bummer. Pavel’s colleague, Simon, is often unhappy with Pavel’s co +de. Simon never uses regular expressions in his code. He’s just jealo +us! If he knew a thing about regular expressions, he wouldn’t complai +n. Once Pavel was careless to remark that Simon is not creative enough to + build complex regular expressions. Simon replied, quite sacrilegious +ly, that there was nothing creative about regular expressions. Their +argument evolved into a competition: they decided to find out who wou +ld be the fastest to write a regular expression matching a defined se +t of strings. Simon wants to win by creating a program which would so +lve this problem automatically, thus proving the lack of creative com +ponent in the process. Their friend Peter, a bioinformatician, will b +e the judge. Simon glanced at the syntaxis of Perl regular expressions and was horr +ified: is this what programmers have done to a simple and elegant con +cept? Simon graduated from the mathematics department of the universi +ty and has always thought that a regular expression is the following: 0 (zero character) — a regular expression that matches no string. a, g, t, c (letter) — a regular expression matching precisely one +string: a single-letter string with the same letter as that in the ex +pression. If R and P are regular expressions, then (R|P) is also a regular e +xpression. It matches all strings which match at least one of the exp +ressions R and P. If R and P are regular expressions, then (RP) is also a regular ex +pression. It matches all strings which can be split in two in such a +way that the first part matches the expression R, and the second part + matches the expression P. If R is a regular expression, then (R*) is also a regular expressi +on. It matches all strings which can be split into k >= 0 parts so th +at each part matches the expression R. For instance, the regular expression (a*) matches any string containin +g only letters a, including an empty string. The regular expression ( +0*) only matches an empty string (which can be split into zero parts +matching the expression 0). The regular expression (a|(g(tc))) matche +s two strings: a and gtc. Note that it is forbidden to omit or add ex +tra brackets to regular expressions: it must contain strictly those p +airs of brackets which appear during its construction according to th +e rules described above. Simon wants a flawless victory by building the shortest matching regul +ar expression. Help Simon. Write a program which finds the shortest r +egular expression for a set of strings, such that all these strings m +atch the expression.
Input format
The first line contains an integer T — the number of tests. It is foll +owed by test descriptions. The first line of a test description contains a single positive intege +r N — the number of strings in the set. Each of the following N lines + begins with a nonnegative integer L — the length of the string from +the set, followed by the string itself, separated by a space. The str +ing only contains lowercase latin letters a, g, t, c. Note that a str +ing in the set can be empty. In this case the line in the file will o +nly contain the digit 0 and a space symbol. The total number of strings in all sets is not greater than 2000. The +sum of lengths of all strings in all sets is not greater than 2000.
Output format
Print precisely T regular expressions, one per line. Each regular expr +ession must be an answer to a corresponding test. If for any test the +re is no regular expression matched by all strings from the set, prin +t the word Impossible instead of the answer.
Sample
Input
3 2 1 a 3 gtc 1 0 3 1 g 2 gg 3 ggg
Output
(a|(g(tc))) (0*) (g*)
Notes
The sixth line of the input data in the example zero is followed by a +space symbol.
Here I will write ideas about solving.
Firstly, I was thinking about two approaches: 1) to generate regex from an input data, e.g. to make a regex from the first input line, and later modify (expand) it when analysing next lines; 2) to generate all possible regexes, then sort them by length, then try to match all input lines against shortest regex, if fail then match against longer regexes.
The first approach seems more sophisticated and I haven't found any ideas how to solve by that way. Can you suggest something? Second approach seems easier, and I successfully used it.
Secondly, I realized that there are a regular expression which can match any possible line composed by only 4 distinct letters. So, I need to generate all regular expressions, which are shorter than all-matching-regex. Thirdly, I was thinking how to generate possible regexes. There were one idea at first. Later I got another. First idea was to generate all possible permutations of letters, and later add all possible permutations of other symbols into permutations of letters. This approach take much time for me, and it wasn't successful. Of course I tried to find some logics and avoid generating regular expressions which are longer than their shorter equivalents. Second idea was to generate regular expressions by joining smallest ones - only letters (say "atoms") - to the bigger and bigger ones. That expanding was performed by binary joining or by enveloping regex with Kleene star.
Further are more ideas about solution, and the code.

It become obvious that regex [acgt]* is the shortest one, which matches any line, including empty. Later, one can find that number of parentheses of Simon's regex depends on the number of letters + stars. So shortest all-matching regex, e.g. ((((a|c)|g)|t)*), has length of 16, or ((letters+stars)-1)*3 +1 +sticks.
Later I found out that:
* any consecutive characters can be written with star reaching minimal length of 4: (aa) [4], ((aa)a) [7], ... -> (a*) [4].
* any star inside an OR construction, can be brought outside an OR: ((a*)|c) [8] -> ((a|c)*) [8], (g|((ac)*)) [11] -> ((g|(ac))*) [11].
* one can never use 0 symbol in regexes, because any other symbol (a letter) with star matches an empty line. And if there are an empty line in the input, then correct regex must contain a star.
* regex can have upto 5 characters from the set [*acgt] at most, because 6 characters make regex longer than the all-matching regex. That means also, that if an input line contains more than 5 letters, the star must be included into regex.
* the set of letters of a regex can not differ by more than 1 from the number of letters+stars. It is because one can generate a shorter all-set matching regex [<set>]*, e.g. (((ac)(ga))c) [13] is longer than simply (((a|c)|g)*) [12].

These and other ideas helped me to generate less regexes, which should cover all possible inputs of lines. I used to generate all possible regexes by length less than 16, excluding regexes with more than 1 star, and excluding regexes with stars which are inside an OR alternatives.
I used two similar approaches. To use generated regexes against an input data. First is to grep all matching regexes while examining first input line. So at the second input line it will be less regexes to try to match. Second approach was to take a shortest regex and try to match all input lines, and take next regex if process fails, and repeat.
I got Time-limit. It seams that my approach is not efficient.
Later I remembered that there are an operator 'qr', which compiles regex once, so it saved program running time much. But still Time-Limit.
Next I tried to minimize a set of regexes which I've generated. For example, regexes ((ac)g) and (a(cg)) have a same meaning, so I used to delete such "duplicates". I tried to find more regexes which are "duplicates" and then deleted them. It saved a bit of program running time.

The next idea was to use "abstract" regexes. I mean regexes which contains a "regex dot" (any character) instead of actual letters. So I used to generate all possible abstract regexes, and got only some hundreds of them. Then I used these regexes against input data, and many of regexes matched. But the number decreased by about 3-10 times. So next I matched compiled (qr) non-abstract regexes against an input data, and a set of non-abstract regexes was decreased by about 3-10 times, from few dozen thousand to some thousand. In my code I used a data structure of Hash_of_Arrays. Keys of hash are abstract regexes. Arrays contain compiled regexes corresponding to abstract regex. (Earlier I tried to use Hash_of_Hashes because I used non-abstract regexes as strings in keys of deeper hashes). My code got Accepted with 2.5 sec time, but test cases are not available, and not plenty of them (only 20), so I am not sure if my solution is correct for all possible data.

... I was thinking that the type of "abstract" regexes which have backreferences (<named> for a comfort) can be useful. But I am not sure. I tried unsuccessfully.
Maybe this problem can be solved with more advanced techniques: recursive regexes, eval-groups, and other?

Related topic: Perl in programming contests and problem solving
#regexes #qr #golf #problem #efficiency
CODE:
#!/usr/bin/perl use warnings; use strict; use re 'eval'; use Time::HiRes qw( gettimeofday tv_interval ); my $t0 = [ gettimeofday ]; $\ = $/; my $debug = 0; my $debug2 = 0; my $time = 0; my @acgt = qw( a c g t ); my @combinations = map { [ combinations( $_, $_ - 1, @acgt ) ] } 1 + .. 5; $debug and print ~~ @{ $_ } for @combinations; my %regexes = ( '_' => 1 ); while( 1 ){ my @old_regexes = sort { length $a <=> length $b || $a cmp $b +} keys %regexes; $debug and print " " . @old_regexes; for my $i ( 0 .. @old_regexes - 1 ){ my $copy_of_index_i = $i; $i = $old_regexes[ $i ]; if( $i !~ /\*/ ){ $regexes{ "($i*)" } ++ if ( length $i ) + 3 < 16; for my $j ( $copy_of_index_i + 0 .. @old_regexes - 1 ) +{ $j = $old_regexes[ $j ]; last if ( length $i ) + ( length $j ) + 3 > 15; next if $j =~ /\*/; $regexes{ "($i|$j)" } ++; # $regexes{ "($j|$i)" } ++; } } for my $j ( $copy_of_index_i + 0 .. @old_regexes - 1 ){ $j = $old_regexes[ $j ]; last if ( length $i ) + ( length $j ) + 2 > 15; next if $i =~ /\*/ && $j =~ /\*/; $regexes{ "($i$j)" } ++; $regexes{ "($j$i)" } ++; } } last if @old_regexes == keys %regexes; } $debug2 and print "regexes(30): ", join ' ', ( sort { length $a <= +> length $b } keys %regexes )[ 0 .. 30 ]; $debug and print "number of abstract regexes: " . keys %regexes; my %uniq; map $uniq{ $_ } ++, keys %regexes; %regexes = map { $_ => 1 } keys %uniq; $debug and print ~~ keys %regexes; for my $re ( sort keys %regexes ){ $re =~ / (\w) [(] (\w)(\w) [)] /x or next; delete $regexes{ $re }; $re =~ s/ (\w) [(] (\w)(\w) [)] /($1$2)$3/x; $regexes{ $re } ++; } $debug and print ~~ keys %regexes; for my $re ( sort keys %regexes ){ $re =~ / (\w) [(] [(] (\w)(\w) [)] (\w) [)] /x or next; delete $regexes{ $re }; $re =~ s/ (\w) [(] [(] (\w)(\w) [)] (\w) [)] /(($1$2)$3)$4/x; $regexes{ $re } ++; } $debug and print ~~ keys %regexes; for my $re ( sort keys %regexes ){ $re =~ / [(] (\w) (\w) [)] [(] (\w) (\w) [)] /x or next; delete $regexes{ $re }; $re =~ s/ [(] (\w) (\w) [)] [(] (\w) (\w) [)] /(($1$2)$3)$4/x; $regexes{ $re } ++; } $debug and print ~~ keys %regexes; $regexes{ "((((a|c)|g)|t)*)" } ++; $debug and print ~~ keys %regexes; my @regexes = sort { length $a <=> length $b } keys %regexes; y/_/./ for @regexes; $debug2 and print "regexes(30): @regexes[ 0 .. 30 ]"; %regexes = map { $_ => 1 } @regexes; my %HoR; for my $re ( keys %regexes ){ my $cnt = () = $re =~ /\./g; for my $comb ( @{ $combinations[ $cnt - 1 ] } ){ $_ = $re; for my $letter ( split //, $comb ){ s/\./$letter/; } next if /\*/ and 4 == $cnt and /(\w).*\1/; next if / [(] (\w) \| (\w) [)] /x and $1 ge $2; next if / [(] [(] (\w) \| (\w) [)] \| (\w) [)] /x and $1 g +e $2 || $1 ge $3 || $2 ge $3; push @{ $HoR{ $re } }, qr/^$_$/; } } if( $debug ){ my @c = ( 0 ) x 5; my $c = 0; for my $re ( keys %HoR ){ my $cnt = () = $re =~ /\./g; $c[ $cnt - 1 ] += @{ $HoR{ $re } }; $c += @{ $HoR{ $re } }; } print "all regexes: " . $c; print " $_" for @c; } $time and print ' ', tv_interval( $t0 ); ################ MAIN: <>; while( <> ){ @_ = map ~~<>, 1 .. $_; chomp @_; s/\S+ // for @_; $debug2 and print "\@_:@_"; s/(.)\1{2,}/ $1 x 3 /ge for @_; s/(.{2,4})\1{2,}/ $1 x 2 /ge for @_; $time and print ' ', tv_interval( $t0 ); my %uniq; map $uniq{ $_ } ++, @_; @_ = sort keys %uniq; my @regexes = sort { length $a <=> length $b } keys %HoR; for my $str ( @_ ){ $debug and print " candidate abstract regexes: " . @regexes; $debug2 and print " str:[$str]"; my @ok; for my $re ( @regexes ){ $str =~ /^$re$/ and push @ok, $re; } @regexes = @ok; } $debug and print " abstract regexes which match: " . @regexes; $debug2 and print " abstract regexes: @regexes"; my @real; @regexes = map @{ $HoR{ $_ } }, @regexes; for my $str ( @_ ){ $debug and print " candidate regexes: " . @regexes; $debug2 and print " str:[$str]"; my @ok; for my $re ( @regexes ){ $str =~ $re and push @ok, $re; # HOT SPOT } @regexes = @ok; } $debug and print " regexes which match: " . @regexes; my $shortest = shift @regexes; my $strip = join "(.*)", map quotemeta, qw[ (?^:^ $) ]; map { s/^ $strip $/$1/x } $shortest; #^ map { s/^ \Q(?^:^\E //x, s/ \Q$)\E $//x } $shortest; # alternat +ive: OK print $shortest ? $shortest : "Impossible"; $time and print tv_interval( $t0 ); $debug and print '-' x 20; } ########## END MAIN sub combinations { my( $length, $min_set, @letters ) = @_; my $str = join '-', ( join '', @letters ) x $length; $debug2 and print $str; my $re = join join( '-', ( '.*?' ) x 2 ), ('(.)') x $length; $debug2 and print $re; my %combs; $str =~ /$re (?{ $combs{ join '', grep defined, $1, $2, $3, $4, $5 + } ++ }) (*FAIL)/x; my @combs = keys %combs; $debug2 and print "all combs: " . @combs; # @combs = grep { !/(.)\1/ } @combs; # $debug2 and print "all combs: " . @combs; @combs = grep { my %uniq; map $uniq{ $_ } ++, split //; $min_set <= keys %uniq; } @combs; $debug2 and print "unique combs: " . @combs; $debug2 and print "@combs"; return @combs; }
INPUT
* 2 1 a 3 gtc 1 0 3 1 g 2 gg 3 ggg 2 - tcac - gcac 2 - tac - gtc 2 - gag - tcag 2 - ga - ctca 2 - a - acaca 2 - aaa - cccc 3 - aaa - ccc - ttt 4 - a - c - g - t 4 - aaaaa - c - g - t 3 - acac - caca - g 1 - g 1 - gcgc 1 - gcgcg 1 - acacacacacacacacacac 1 - aaaaaaaaaaaaaaaaaaaaaaaaaac 1 - aaaaaaaaaacg 1 - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 1 - acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac 1 - acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +g 1 - acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +acacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacaca +cacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacacac +gt 9 - a - aaaaaaaaaaaaaaaaaaaaaaa - aa - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacccc - aac - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac - ac - aaaaaac - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 32 - a - c - t - g - aa - cc - tt - gg - a - c - t - g - aa - cc - tt - gg - a - c - t - g - aa - cc - tt - gg - a - c - t - g - aa - cc - tt - gg 2 - acgtgca - 2 - acgt - 1 - aaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaa +acccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccc +cctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaa +aaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccc +cccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaa +aaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccc +cccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaa +aaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccc +cccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaa +aaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccc +cccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaa +aaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacc +cccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccct +aaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaa +cccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccccccccccccc +ctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaa +aacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccccccccccc +ccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaa +aaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccccccccc +ccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaa +aaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccccccc +ccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaa +aaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccccc +ccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaa +aaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaaccc +ccccccccccctaaaaaaaaaaaaacccccccccccccctaaaaaaaaaaaaacccccccccccccct 1 - acggcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacggacggacgacgacga +gcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggac +ggacggagcggacggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcagcga +gcgacgacgacgacgagcgacggacggacggacggagcggacggacggacgacgacgagcgacggagcg +agcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggagcgg +acggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgac +gacgagcgacggacggacggacggagcggacggacggacgacgacgagcgacggagcgagcgacggacg +agcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacggacggacg +acgacgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacg +gacggacggacggagcggacggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacg +gcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacggacggacgacgacgagcga +cggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggac +ggagcggacggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcga +cgacgacgacgagcgacggacggacggacggagcggacggacggacgacgacgagcgacggagcgagcg +acggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacgg +acggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacg +agcgacggacggacggacggagcggacggacggacgacgacgagcgacggagcgagcgacggacgagcg +acgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacggacggacgacga +cgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacg +gacggacggagcggacggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcag +cgagcgacgacgacgacgagcgacggacggacggacggagcggacggacggacgacgacgagcgacgga +gcgagcgacggacgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggag +cggacggacggacgacgacgagcgacggagcgagcgacggacgagcgacgaacggcagcgagcgacgac +gacgacgagcgacggacggacggacggagcggacggacggacgacgacgagcgacggagcgagcgacgg +acgagcgacgaacggcagcgagcgacgacgacgacgagcgacggacggacggacggagcggacggacgg +acgacgacgagcgacggagcgagcgacggacgagcgacgat 1 - acgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacg 1 - acgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgt 2 - - 2 - - c 3 - - c - t 3 - - ccc - t 2 - acgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacgt - acgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgacgac +gacgacgacgacga 2 - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacgcgcgcgcgcgcgcgaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaa - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacgcgcgcgcgcgcgcgaaaaaaaaaaaa +aaaaaaaaaaaaaaaaaat 8 - aaaacgcgaaaa - aaaacgcgaaaat - aaaacgcgaaaatttttt - aaaacgcgaaaatttt - aaaacgcgcgaaaatttttt - aaaaaaacgcgaaaatttttt - cgcgaaaatttttt - aaaacgcgaaaattttttcgcg 1 - acg
OUTPUT:
(a|((gt)c)) (c*) (g*) ((((g|t)c)a)c) (((gt)|(ta))c) ((g|(tc))(ag)) ((g|((ct)c))a) ((a|c)*) ((a|c)*) ((t|(a|c))*) ((a|t)|(g|c)) ((((a|c)|g)|t)*) ((a|(c|g))*) g ((gc)*) ((c|g)*) ((ac)*) ((a*)c) ((a*)(cg)) ((a*)c) ((ac)*) (((ac)*)g) ((((ac)*)g)t) ((a|c)*) ((((a|c)|g)|t)*) ((((a|c)|g)|t)*) ((((ac)g)t)*) ((t|(a|c))*) (((c|(a|g))*)t) (((ac)g)*) ((((ac)g)*)t) (c*) (c*) ((c|t)*) ((c|t)*) (((cg)|(a|t))*) (((cg)|(a|t))*) (((cg)|(a|t))*) ((ac)g) real 0m0.616s user 0m0.584s sys 0m0.028s
OUTPUT verbose ($debug = 1;):
4 16 60 168 240 1 4 25 144 235 number of abstract regexes: 235 235 208 203 198 199 all regexes: 15264 8 76 960 3384 10836 candidate abstract regexes: 199 candidate abstract regexes: 60 abstract regexes which match: 45 candidate regexes: 2550 candidate regexes: 954 regexes which match: 292 (a|((gt)c)) -------------------- candidate abstract regexes: 199 abstract regexes which match: 27 candidate regexes: 746 regexes which match: 746 (c*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 60 candidate abstract regexes: 35 abstract regexes which match: 28 candidate regexes: 1194 candidate regexes: 528 candidate regexes: 318 regexes which match: 318 (g*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 114 abstract regexes which match: 114 candidate regexes: 6382 candidate regexes: 284 regexes which match: 249 ((((g|t)c)a)c) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 136 abstract regexes which match: 136 candidate regexes: 9366 candidate regexes: 669 regexes which match: 249 (((gt)|(ta))c) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 136 abstract regexes which match: 76 candidate regexes: 3210 candidate regexes: 277 regexes which match: 245 ((g|(tc))(ag)) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 114 abstract regexes which match: 67 candidate regexes: 2086 candidate regexes: 258 regexes which match: 246 ((g|((ct)c))a) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 60 abstract regexes which match: 34 candidate regexes: 906 candidate regexes: 462 regexes which match: 261 ((a|c)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 136 abstract regexes which match: 136 candidate regexes: 9366 candidate regexes: 318 regexes which match: 255 ((a|c)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 136 candidate abstract regexes: 136 abstract regexes which match: 136 candidate regexes: 9366 candidate regexes: 318 candidate regexes: 255 regexes which match: 243 ((t|(a|c))*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 60 candidate abstract regexes: 60 candidate abstract regexes: 60 abstract regexes which match: 60 candidate regexes: 3940 candidate regexes: 1444 candidate regexes: 466 candidate regexes: 306 regexes which match: 264 ((a|t)|(g|c)) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 136 candidate abstract regexes: 45 candidate abstract regexes: 45 abstract regexes which match: 45 candidate regexes: 2550 candidate regexes: 318 candidate regexes: 257 candidate regexes: 243 regexes which match: 240 ((((a|c)|g)|t)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 114 candidate abstract regexes: 114 abstract regexes which match: 29 candidate regexes: 1002 candidate regexes: 268 candidate regexes: 255 regexes which match: 243 ((a|(c|g))*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 60 candidate regexes: 3940 regexes which match: 1444 g -------------------- candidate abstract regexes: 199 abstract regexes which match: 114 candidate regexes: 6382 regexes which match: 293 ((gc)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 82 candidate regexes: 2922 regexes which match: 261 ((c|g)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 84 candidate regexes: 2062 regexes which match: 293 ((ac)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 114 candidate regexes: 6382 regexes which match: 283 ((a*)c) -------------------- candidate abstract regexes: 199 abstract regexes which match: 82 candidate regexes: 2922 regexes which match: 264 ((a*)(cg)) -------------------- candidate abstract regexes: 199 abstract regexes which match: 114 candidate regexes: 6382 regexes which match: 283 ((a*)c) -------------------- candidate abstract regexes: 199 abstract regexes which match: 85 candidate regexes: 2086 regexes which match: 293 ((ac)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 78 candidate regexes: 1962 regexes which match: 260 (((ac)*)g) -------------------- candidate abstract regexes: 199 abstract regexes which match: 91 candidate regexes: 2194 regexes which match: 258 ((((ac)*)g)t) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 60 candidate abstract regexes: 35 candidate abstract regexes: 28 candidate abstract regexes: 26 candidate abstract regexes: 26 candidate abstract regexes: 26 abstract regexes which match: 26 candidate regexes: 714 candidate regexes: 408 candidate regexes: 318 candidate regexes: 318 candidate regexes: 258 candidate regexes: 256 candidate regexes: 256 regexes which match: 256 ((a|c)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 60 candidate abstract regexes: 35 candidate abstract regexes: 35 candidate abstract regexes: 35 candidate abstract regexes: 35 candidate abstract regexes: 35 candidate abstract regexes: 35 abstract regexes which match: 35 candidate regexes: 2118 candidate regexes: 813 candidate regexes: 318 candidate regexes: 257 candidate regexes: 255 candidate regexes: 243 candidate regexes: 243 candidate regexes: 240 regexes which match: 240 ((((a|c)|g)|t)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 27 abstract regexes which match: 12 candidate regexes: 466 candidate regexes: 466 regexes which match: 240 ((((a|c)|g)|t)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 27 abstract regexes which match: 20 candidate regexes: 614 candidate regexes: 614 regexes which match: 256 ((((ac)g)t)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 86 candidate regexes: 2110 regexes which match: 243 ((t|(a|c))*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 80 candidate regexes: 2010 regexes which match: 243 (((c|(a|g))*)t) -------------------- candidate abstract regexes: 199 abstract regexes which match: 90 candidate regexes: 2170 regexes which match: 265 (((ac)g)*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 80 candidate regexes: 2010 regexes which match: 253 ((((ac)g)*)t) -------------------- candidate abstract regexes: 199 abstract regexes which match: 27 candidate regexes: 746 regexes which match: 746 (c*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 27 abstract regexes which match: 10 candidate regexes: 418 candidate regexes: 418 regexes which match: 304 (c*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 27 candidate abstract regexes: 10 abstract regexes which match: 10 candidate regexes: 418 candidate regexes: 418 candidate regexes: 304 regexes which match: 255 ((c|t)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 27 candidate abstract regexes: 19 abstract regexes which match: 10 candidate regexes: 418 candidate regexes: 418 candidate regexes: 304 regexes which match: 255 ((c|t)*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 80 abstract regexes which match: 80 candidate regexes: 2010 candidate regexes: 249 regexes which match: 243 (((cg)|(a|t))*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 83 abstract regexes which match: 70 candidate regexes: 1770 candidate regexes: 249 regexes which match: 243 (((cg)|(a|t))*) -------------------- candidate abstract regexes: 199 candidate abstract regexes: 85 candidate abstract regexes: 70 candidate abstract regexes: 70 candidate abstract regexes: 70 abstract regexes which match: 70 candidate regexes: 1770 candidate regexes: 249 candidate regexes: 243 candidate regexes: 243 candidate regexes: 243 regexes which match: 243 (((cg)|(a|t))*) -------------------- candidate abstract regexes: 199 abstract regexes which match: 136 candidate regexes: 9366 regexes which match: 669 ((ac)g) -------------------- real 0m0.678s user 0m0.668s sys 0m0.004s

Replies are listed 'Best First'.
Re: contest problem about shortest Regular expression
by choroba (Cardinal) on Nov 28, 2017 at 23:18 UTC
    See chapter 6.5 Regex String Generation in Higher Order Perl, the book by Mark Jason Dominus.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you for this resource, haven't read this book so far. In 6.5 chapter I found explanations how to generate string stream - to generate all strings which are covered by given regex, nice explaning schemes included. So it looks as an opposite problem.
        Ouch, you're right. I misread the assignment, I had the impression it wants the strings. OK, so you just want to optimize a finite automaton generated by joining all the input strings by | :-)

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: contest problem about shortest Regular expression
by QM (Parson) on Dec 01, 2017 at 13:40 UTC
    Has anyone looked at Regexp::Optimizer lately? I played with it many years ago, and it worked well enough for my toy problem. I don't know what state it's in now. It seems not to have been updated for about 5 years, but that may be a good thing.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1204472]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-04-19 10:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found