Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

build regexp on a list of patterns

by mod_alex (Beadle)
on Apr 06, 2005 at 08:07 UTC ( [id://445205]=perlquestion: print w/replies, xml ) Need Help??

mod_alex has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl folk I would greatly appreciate if you tell your ideas for the next task :
TASK: Build a list of regexps on a given list of patterns
---------------- example : INPUT : @list = ('12341', '12342', '12343', '13341', '13342') RESULT : @res=('1[23]34[12]', '12343') ----------------
I hope you got an idea.
thanx in advance
Alex

Replies are listed 'Best First'.
Re: build regexp on a list of patterns
by borisz (Canon) on Apr 06, 2005 at 08:20 UTC
    use Regexp::Assemble.
    use Regexp::Assemble; my @list = ('12341', '12342', '12343', '13341', '13342'); my $ra = Regexp::Assemble->new; $ra->add( @list ); print $ra->re __OUTPUT__ (?-xism:1(?:234[123]|334[12]))
    Boris
Re: build regexp on a list of patterns
by kvale (Monsignor) on Apr 06, 2005 at 08:21 UTC
    Regex::PreSuf can combine a list of words into a more compact set of alternations and character classes:
    use Regex::PreSuf; my $re = presuf(qw(foobar fooxar foozap)); # $re should be now 'foo(?:zap|[bx]ar)'

    -Mark

Re: build regexp on a list of patterns
by bart (Canon) on Apr 06, 2005 at 09:09 UTC
    Apart from the classic Regex::PreSuf, and, apparently (thanks Borisz) Regexp::Assemble, there's Regexp::List too, part of the Regexp::Optimizer package.

    I wonder on the use of the availability of so many similar but not quite identical modules. I'm thinking that somehow, these should be unified into one, containing the best of all worlds, package.

    Or maybe not, thanks to demerphq's recent work that got integrated into Perl 5.9.2 — and likely soon in a release for the general public, either 5.8.x or 5.10.x, too.

      I wonder on the use of the availability of so many similar but not quite identical modules

      My rationale for writing Regexp::Assemble was that when I started looking around for something to do what I needed to be done, I didn't find what I wanted. Regex::PreSuf does not recognise metacharacters, so a\dz and a\sb produces something odd, like a\(?:dz|sb), which is uncompileable.

      Regexp::Optimizer came closer to what I wanted, in that it understands metacharacters, but it doesn't do what I call tail folding. I have a large list of regular expressions that tend to share common tails and I was interested in producing the shortest expression possible (even though it can result in a more complex pattern that may actually perform a bit more slowly than if the tails were left alone).

      For instance, on a sample of 4000 dictionary words weighing about 46Kb, the resulting R::A pattern is 23Kb long, whereas the R::O pattern is 28Kb long. For me the gain of 5Kb was an important consideration.

      Another important issue is that while I consider Regexp::Assemble to be slow, it is no slouch. It assembles 3000 complex patterns into one in about 1.2 seconds. I fed the same set of patterns to Regexp::Optimizer, and 850 cpu seconds later it is still grinding away. (update: still chugging away, now at 11500 seconds. I am beginning to wonder whether it will terminate. 5Mb of core, so it's not swapping... later: 11 CPU hours later, Regexp::Optimize is still running. One may reasonably consider that it doesn't work on large patterns. There must be something exponential happening).

      I look on in interest at demerphq's work in implementing tries in the regular expression motor, but as it doesn't deal with metacharacters at the moment, it's usefulness is somewhat limited for what I'm doing.

      - another intruder with the mooring in the heart of the Perl

        It would be very helpful, if you add this rationale to the pod of Regexp::Assemble.
        Boris

        IMO the only reason R::L beats R::A is because of the char class assertion that is used in R::L. I suspect that you would find that R::A performs equally well once you do this. The other thing is that R::A will change the order that the words are matched, and R::L wont. Im not sure if this is intentional or not as I can see it being a nice idea to sort the strings by order of frequence of their leading char, at least when matching some kind of "normal" text.

        As for my trie work and the upcoming Aho-Corasick patch, you are right it doesnt deal with metacharacters and it probably never will. This is why modules like yours will always have a place on CPAN, there are too many types of pattern manipulation that can occur during optimisation to put them all in perl. The cost of such optimisations are shared across every regex compiled or executed so having rarely useful optimisations built in doesnt make sense, the few cases that get improved are outweighed by the cases that are slowed down by the extra optimisation logic. Modules like yours however only come into play at the users request and as such can be far more "aggressive" in what they do.

        Having said that, I really hope you take the time to educate R::A about the new trie optimisation so that when it can take advantage of the optimisation it does. A good example would be converting a list of the following patterns: ('foo','ba[rz]') into /foo|bar|baz/ which will match much faster.

        ---
        demerphq

      Thanx a lot for your responses. I made a test for all this mentioned above modules.
      Without providing any numbers just want to say I've found most fast and sofisticated Regexp::List
      Alex
Re: build regexp on a list of patterns
by Zaxo (Archbishop) on Apr 06, 2005 at 08:47 UTC

    The regexen we might build for that are neither unique nor equivalent to each other. If you want a regex that only matches something literal in the list,

    my $re = do { local $" = '|'; qr/^@{[map {quotemeta} @list]}$/; };
    Automatically generated regexes are often inefficient or unwieldy in the interest of simpler generating code. Would you take qr/^[11111][22333][33333][44444][12312]$/ as a solution?
    my @sets; for (@list) { my $str = $_; for (0..(length-1)) { $sets[$_] .= substr $str, $_, 1; } } my $re = qr/^${\join '', map {"[$_]"} @sets}$/;
    This is a pretty open-ended problem. You should look carefully at what you expect from your data and what kind of solution you want.

    Update: If the real problem is to construct an efficient test of whether a string is in @list, use a hash:

    { my %test_hash; @test_hash{@list} = (); sub efficient_test { exists $test_hash{+shift}; } }

    After Compline,
    Zaxo

      my $re = do { local $" = '|'; qr/^@{[map {quotemeta} @list]}$/; };
      In this case, it would appear that they are all the same length, but in general, the list should be sorted longest-first, so that @list=qw/foo bar foobar/ will ever match foobar.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://445205]
Approved by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-19 07:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found