Re: Regular Expression Builder
by tommyw (Hermit) on Aug 30, 2002 at 15:36 UTC
|
#!/usr/bin/perl
$vowels='aeiouy';
$cons='bcdfghjklmnpqrstvwxzy';
%map={C=>$cons, V=>$vowels;
for $class=($vowels, $cons) {
for (split //, $class) {
map{$_}.=$class;
}
}
for $char (split //, shift) {
$pat.="[$map{$char}]";
}
$re=qr/^${pat}$/i;
print "REGEX is $re\n";
@ARGV='/usr/dict/words'
if -t && !@ARGV;
while (<>) {
print if /$re/;
}
Which takes a word, and builds a template from it with the same pattern of vowels and consonants. Although the original is commented. Extending this to handle digits should be easy. The cunning part will be collapsing the multiple character classes down, and using a multiple instead.
This is, of course, left as an exercise for the reader ;-)
--
Tommy
Too stupid to live.
Too stubborn to die.
| [reply] [d/l] |
Re: Regular Expression Builder
by erikharrison (Deacon) on Aug 30, 2002 at 16:34 UTC
|
The challenge here is asking youself "What kind of regexes do I want my tool to generate". This makes things a little harder and is one of the reasons that this kind of tool isn't on the market.
A computer program cannot read your mind, obviously. So, the regexes generated from a single simple string will be rather simple - there isn't enough data to work with to create a complex expression there. For example, should the regex retain length? When should a regex generalize a character into a character class or match exactly? If we generalize out to a character class, what about when a character could be placed in several different character classes?
While the tool could produce more useful regexes from additional data (such as multiple strings) the question remains - by what rules do we generate a regex from the given data? The rules will vary from project to project, so a tool that has rules builtin will not be very useful to others, and as such won't be out there in the market. If you want a tool you can program regex generating rules into, you get into a layer of abstraction which makes things harder not easier on the programmer - you'd be better off generating the regexes yourself.
Some tools that might help you out - Parse::RecDecent Parse::Yapp and perhaps Regex::English.
Cheers,
Erik
Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet
| [reply] |
|
| [reply] [d/l] |
|
What kind of regexes do I want my tool to generate
E.g. give that generator a handfull Strings, recognize a possible pattern behind that and generate the regular expression to recognize these strings. That would cut the number of possible solutions down to a reasonable amount.
Problem is the pattern recognition or is ther a module?
And it came to pass that in time the Great God Om spake unto Brutha, the Chosen One: "Psst!"
-- (Terry Pratchett, Small Gods)
| [reply] |
Re: Regular Expression Builder
by demerphq (Chancellor) on Aug 30, 2002 at 16:19 UTC
|
I doubt that there is a robust way to do this, but heres a really simple way:
my $string="123 abcdef";
$string=~s{(\d+)|(\w+)|(\s+)}
{
defined($1) ? '\\d{'.length($1).'}'
: defined($2) ? '\\w{'.length($2).'}'
: '\\s{'.length($3).'}'
}ge;
print $string;
__END__
\d{3}\s{1}\w{6}
But i dont think this will scale very well... (and probably has subtle problems anyway)
Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)
| [reply] [d/l] |
|
my $string=" \aabc123def!*#\n";
$string=~s{ ([[:digit:]]+)
|([[:alpha:]]+)
|([[:punct:]]+)
|([[:space:]]+)
|([[:cntrl:]]+)
|(.)
}
{
defined($1) ? '[[:digit:]]{'.length($1).'}'
: defined($2) ? '[[:alpha:]]{'.length($2).'}'
: defined($3) ? '[[:punct:]]{'.length($3).'}'
: defined($4) ? '[[:space:]]{'.length($4).'}'
: defined($5) ? '[[:cntrl:]]{'.length($5).'}'
: "\Q$+\E" # anything else?
}gex;
print $string;
But it still has problems (for example, \n is in both :space: and
:cntrl: so "\n\a" produces [[:space:]]{1}[[:cntrl:]]{1},
but "\a\n" produces [[:cntrl:]]{2}).
| [reply] [d/l] [select] |
|
One quibble is that because \d is a subset of \w then a string such as "abc123def" will get \w{9} in your version.
Yup. But personally I consider that a feature not a bug. :-) After all ldkjdlkjf2098kklls probably isnt [[:alpha:]]+\d+[[:alpha:]]+
But we are both in agreement that there isnt a good way to do this, although as we both have shown there are a variety of bad ways to do it... BTW, is the . really necessary? I dont think it is as the s/// will just skip the char if it doesnt match.
Oh and I considered using something like you post here, but I fgured that considering I tend not to use the POSIX char classes that much probably others wouldnt either.
:-)
Yves / DeMerphq
---
Software Engineering is Programming when you can't. -- E. W. Dijkstra (RIP)
| [reply] [d/l] |
|
|
Re: Regular Expression Builder
by Anonymous Monk on Aug 30, 2002 at 16:19 UTC
|
/\w{6}!/
/\w+!/
/[A-Z][a-z]{3}\d\d!/
/Rich36!/
/......./
/\S+/
/.*/
I mean, the tightest or least general thing it could produce when
given a $string is just /\Q$string\E/ and the most general
thing would be /.*/s, and between those is a rather large
space of candidates.
| [reply] [d/l] [select] |
Re: Regular Expression Builder
by zentara (Archbishop) on Aug 30, 2002 at 16:17 UTC
|
There is a bash script at
txt2regex
that lets you make regexes based on a simple
question and answer menu. It might give you an idea | [reply] |
Re: Regular Expression Builder
by fruiture (Curate) on Aug 30, 2002 at 16:50 UTC
|
Well, 'rich36' could be translated to '\w{4}\d{2}' or to '\w{6}' or '.{6}' ... You need to specify that [a-zA-Z] must become \w and [0-9] must become \d ...
A try:
#!/usr/bin/perl
use strict;
use warnings;
{
my @classes = (
['[a-zA-Z]' => '\w'],
['[0-9]' => '\d'],
['\w' => '_'], #that's why order matters
['.' => '.'],
);
sub make_regex {
local $_ = @_ ? shift : $_;
my $result = '';
my $i = -1;
while( ++$i < @classes ){
my $p = pos($_) || 0;
my ($re,$su) = @{ $classes[$i] };
if( /\G($re+)/g ){
$result .= $su . '{' . length($1) . '}';
$i = -1;
}
else {
pos($_) = $p;
}
}
$result
}
}
printf "%s => %s\n",$_,make_regex for (
'abc12','123','#+#+#',
)
update: corrected [ and ] again (twice)...
--
http://fruiture.de | [reply] [d/l] |
Re: Regular Expression Builder
by bart (Canon) on Aug 30, 2002 at 17:41 UTC
|
Just a thought: replace all letters by "A" and all digits by "9". Then apply the Regex::PreSuf thing — or just quotemeta(). And in that result, replace "A" with '\w' and "9" with '\d'.
Intermediate steps, as an example:
@foo23 -> @AAA99 -> \@AAA99 -> \@\w\w\w\d\d
| [reply] [d/l] |
|
Clever, but add the step
(actually, merge it with the A9 -> metachar translation):
s%((?:\\w)+)%'\w{'. length($1)/2 .'}'%eg;
...
--
perl -pew "s/\b;([mnst])/'$1/g"
| [reply] [d/l] |
|
USER: @foo29
RE: /\@foo2\d/
USER: @zzz99
RE: /\@[a-z]{3}\d{2}/
USER: @AAA99
RE: /\@[a-zA-Z]\d{2}/ #Note that 'A' becomes
#[a-zA-Z] rather than [a-z] with /i
#because there may later be a 'z'
#in your users pattern :)
The code for parsing this shouldn't be too hard to create, but I'd suggest wrapping the following comment in at an earlier stage and parsing the users pattern looking for repeats as you go. | [reply] [d/l] |
Re: Regular Expression Builder
by hiseldl (Priest) on Aug 30, 2002 at 18:46 UTC
|
| [reply] |
Re: Regular Expression Builder
by Boots111 (Hermit) on Aug 30, 2002 at 15:51 UTC
|
All~
Komodo from ActiveState includes a regular expression toolkit that allows you to see what a regex does as you have it, on a sample output.
I know this is not exactly what you are looking for but it might be helpful...
Boots
---
Computer science is merely the post-Turing decline of formal systems theory.
--??? | [reply] |
Re: Regular Expression Builder
by mojotoad (Monsignor) on Aug 31, 2002 at 20:58 UTC
|
| [reply] |