Apply A Set Of Regexes To A String

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Apply A Set Of Regexes To A String by tmoertel (Chaplain) on Oct 11, 2004 at 05:37 UTC
I often prefer solutions that are declarative in nature. Rather than writing code to do the work, I write code to interpret or compile a description of the work into the code that does the actual work. In your problem, for example, we have the following situation: We have regex operations. We want to apply certain of the operations to certain pages. Given a page, we want to know which regex operations to apply, and then we want to apply them. Since I don't know the specifics of your situation, let's say that you're working on books and that you deal with three kinds of pages: front matter, body, and index. Let's further say that each page has two properties: (1) its content (the text to appear on the page) and (2) its page type (one of the three we listed earlier). Now, let's say that we have the following rules for processing the pages: All pages are expected to have a ::PAGENUM:: placeholder that shall be replaced by the page number during processing. On front-matter pages, however, the page number shall be displayed as a roman numeral. Front-matter pages may may contain ::COPYRIGHT:: and ::PRINTING:: placeholders that shall be replaced by copyright and printing information. These placeholders are ignored on other kinds of pages. Body pages require no additional processing for now (but might later). Index pages require no additional processing for now (but might later). I would probably convert the rules into a simple text-based specification that is easy for humans to understand and edit: `body: +all_pages front_matter: s/::PAGENUM::/roman_numeral($page_number)/eg; s/::COPYRIGHT::/Copyright 2004 blah, blah/g; s/::PRINTING::/1st printing, Blah Blah Press/g; +all_pages index: +all_pages all_pages: s/::PAGENUM::/$page_number/eg;` [download] The spec's meaning is straightforward. Each page type is represented by a labeled section. Each section contains a bit of Perl code that gives the substitutions to be performed on pages of that type. Further, to make reuse easy, we define lines of the form `+`label to mean "and now do the stuff specified in the section labeled label, too." The idea is to be able to convert this specification into an engine that makes it easy process pages given their page types. For example, to process and print out a book, this is all the more complicated we should need to get: `my $page_engine = make_regex_engine_from_spec( $spec_fh ); my $page_number = 1; for my $page (@book_pages) { print $page_engine->( @$page{'content','page_type'} ), "\n"; $page_number++; }` [download] That's pretty simple, right? But like most things in life this simplicity comes as a price: We must write the code that reads the spec and converts it into an engine for us. Fortunately, the price is isn't too high: sub make_regex_engine_from_spec { my $fh = shift; # filehandle contains spec my %sections; my $label; # read in spec while (<$fh>) { chomp; next unless /\S/; # skip blanks if (/^(\w+):/) { $label = $1; } else { die "syntax error: need a section label\n" unless $label; push @{$sections{$label}}, $_; } } # compile spec into code my $interpret = sub { local $_ = shift; if ( /^ \s* \+ (\w+) /x ) { if ($sections{$1}) { return '$sections{'.$1.'}->();'; } die "there is no section named '$1'"; } return $_; }; while (($label, my $section) = each %sections) { my $generated_code = join "\n", 'sub {', (map $interpret->($_), @$section), "}\n"; $sections{$label} = eval $generated_code or die "couldn't eval section $label: $@"; } # return processor engine that embodies compiled spec return sub { # args: page content, page type (local $_, my $page_type) = @_; my $processor = $sections{$page_type}; $processor->() if $processor; return $_; } } [download] That might seem like a lot of code. However, it's of constant size and won't change as our regex needs grow and become more complicated. All we'll need to do is change our spec, which we expect will be easier than writing the equivalent code by hand. We're hoping that the simplicity and cost savings of the specification language more than pay for the one-time cost of having to write that function above. To test out the spec-based system, let's create some pages of various types: `my @book_pages = ( { page_type => 'front_matter', content => "This is the copyright page (::PAGENUM::).\n" . "::COPYRIGHT::\n" . "::PRINTING::\n" }, { page_type => 'body', content => "This is a body page (::PAGENUM::).\n" }, { page_type => 'index', content => "This an index page (::PAGENUM::).\n" }, );` [download] And here's what the pages look like when processed sequentially as a book using the `for` loop from earlier: `This is the copyright page (i). Copyright 2004 blah, blah 1st printing, Blah Blah Press This is a body page (2). This an index page (3).` [download] Each of the page types was processed as expected. All of the expected placeholders were replaced on all pages. The copyright page (which is front matter) has a roman-numeral page number. Looks like we're ready to print our book. :) So that's how I might do it: (1) Write a spec. (2) Write code to convert the spec into worker code. (3) Use the worker code to do the work. Cheers, Tom P.S. The complete code, ready to run, is below for your convenience: Read more... (3 kB) Tom Moertel : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder	[reply] [d/l] [select]
Re: Apply A Set Of Regexes To A String by tachyon (Chancellor) on Oct 11, 2004 at 02:37 UTC
I would suggest this. You can simplify the hash if you are happy to use the substitution pattern as the top level key. The guts of the approach is to build an alternation RE dynamically and use the match value to lookup the replacement value in a hash. This is typically the fastest approach as you leverage the C code in the regex and hashing engines effectively. Note the sort by longest first so we match on the full 'foobar' not 'foo' or 'bar'. `my $res = { re1 => { foo => 'foo_new' }, re2 => { bar => 'bar_new' }, re3 => { qux => 'bar_new' }, re4 => { foobar => 'foobar_new' } }; my @required = qw ( re1 re2 re4 ); my %active_re = map{ each %{$res->{$_}} } @required; my $match = join '\|', sort{ length $b <=> length $a } keys %active_re; $match = qr/($match)/; $str = 'foo bar baz foobar'; $str =~ s/$match/$active_re{$1}/g; print $str;` [download] cheers tachyon	[reply] [d/l]
Re^2: Apply A Set Of Regexes To A String by Cody Pendant (Prior) on Oct 12, 2004 at 00:22 UTC
Thanks Tachyon, I think I'm going to go with something like that solution. If you've got a moment, can you say just a little more about why it would be particularly efficient, with regard to the "C code in the regex and hashing engines"? Thanks. ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^3: Apply A Set Of Regexes To A String by tachyon (Chancellor) on Oct 12, 2004 at 06:11 UTC
One of the main benefits of alternation is that you can compile the RE. Essentially an alternation RE is very similar to the loop, but the loop code has been optimised to the task and is in C with alternation, but generalised and less efficient if you do it in perl. The difference is significant, at least that is what this Benchmark shows..... use Benchmark 'cmpthese'; $iterations = 1000000; %re = ( foo => 'foo1', bar => 'bar1', ); $re = join '\|', keys %re; $re = qr/($re)/; $name1 = "RE"; $code1 = << 'END_CODE1'; $_ = 'foo bar'; s/$re/$re{$1}/g; END_CODE1 $name2 = "Loop"; $code2 = << 'END_CODE2'; $_ = 'foo bar'; for my $sub( keys %re ) { s/$sub/$re{$sub}/g; } END_CODE2 cmpthese( $iterations, {$name1 => $code1, $name2 => $code2} ); __END__ Benchmark: timing 1000000 iterations of Loop, RE... Loop: 24 wallclock secs (24.77 usr + 0.00 sys = 24.77 CPU) @ 40 +377.94/s (n=1000000) RE: 8 wallclock secs ( 7.27 usr + 0.00 sys = 7.27 CPU) @ 13 +7551.58/s (n=1000000) Rate Loop RE Loop 40378/s -- -71% RE 137552/s 241% -- [download] But if I change that to a more real world situation by making the string a 14Kb one (approximately a web page size) `$_ = 'foo bar' x 2000; Benchmark: timing 10000 iterations of Loop, RE... Loop: 20 wallclock secs (19.90 usr + 0.00 sys = 19.90 CPU) @ 50 +2.56/s (n=10000) RE: 30 wallclock secs (29.44 usr + 0.00 sys = 29.44 CPU) @ 33 +9.64/s (n=10000) Rate RE Loop RE 340/s -- -32% Loop 503/s 48% -` [download] And now the loop is faster. In fact try this case: `$_ = 'fo ba' x 1000 . 'foo bar'; Benchmark: timing 10000 iterations of Loop, RE... Loop: 1 wallclock secs ( 0.71 usr + 0.00 sys = 0.71 CPU) @ 14 +064.70/s (n=10000) RE: 12 wallclock secs (11.10 usr + 0.00 sys = 11.10 CPU) @ 90 +1.23/s (n=10000) Rate RE Loop RE 901/s -- -94% Loop 14065/s 1461% --` [download] ~This is a purpose designed worst case for alternation as it requires continouous back tracking. So I have shattered my own delusions! Perl loops are faster than RE alternation. cheers tachyon	[reply] [d/l] [select]
Re: Apply A Set Of Regexes To A String by atcroft (Abbot) on Oct 11, 2004 at 02:42 UTC
If the regexen are mutually exclusive (ie, they do not affect the same things), then perhaps something along the lines of the following: my %regexen = ( '' => [ { 'replace' => qr/blah/, 'with' => 'duh' }, { 'replace' => qr/baz/, 'with' => 'quuz' }, { 'replace' => qr/(b)ar/, 'with' => '$1um' } ], 'cgi-bin' => [ { 'replace' => qr!usr/local/bin/perl!, 'with' => 'usr/bin/perl' } ] ); foreach my $directory ( '', 'cgi-bin' ) { foreach my $file ( glob("$directory/") ) { next if ( $file =~ m/\.(jpg\|png\|gif\|zip\|gz\|jar\|pl\|bak\|\~)$/i ); print "Processing... $file (old file as ${file}.bak\n"; open( INF, $file ) or die("Can't open $file for input: $!\n"); open( OUTF, '>' . $file . '.new' ) or die("Can't open $file.new for output: $!\n"); binmode(INF); binmode(OUTF); while ( my $line = <INF> ) { foreach my $rx ( @{ $regexen{$directory} } ) { $line =~ s/$rx->{'replace'}/$rx->{'with'}/; } print OUTF $line; } close(INF); close(OUTF); rename( $file, $file . '.bak' ); rename( $file . '.new', $file ); } } [download] This would loop thru a directory at a time, making changes to each file (and saving the original with a .bak extension). The only problem with the code above is if the modifications affect the same things, in which case if all modifications affect all files, you would add them under the '' set, otherwise you would have to add them to the set for each of the directories (or files, if you need to break it down in that way). I'm not sure if this is necessarily the best way, but it is an idea. Hope it helps.	[reply] [d/l]
Re: Apply A Set Of Regexes To A String by mvc (Scribe) on Oct 11, 2004 at 14:05 UTC
why not use `$_`? for ($html) { s/foo/bar/; s/baz/quux/; s/monkey/pants/; }	[reply] [d/l]
Re: Apply A Set Of Regexes To A String by TedPride (Priest) on Oct 11, 2004 at 19:50 UTC
You can assign a processed (intepreted? compiled?) regex to a variable and then reuse it as many times as you want. Seems to me that the simplest thing to do would be set up all the regexes in advance and assign each file a list of them to run through. Syntax is as follows: http://perlmonks.thepen.com/389503.html Is this associated with perlmonks.com?	[reply]
Re: Apply A Set Of Regexes To A String by ambrus (Abbot) on Oct 11, 2004 at 18:41 UTC
Flex(1) might be a solution here. You just add rules for all the regexps you want to search for, but REJECT the ones you do not want to match.	[reply]
Re: Apply A Set Of Regexes To A String by Cody Pendant (Prior) on Oct 12, 2004 at 00:19 UTC
Thanks everyone, that's given me a lot to think about -- far more than I ever expected! ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: Apply A Set Of Regexes To A String by paulbort (Hermit) on Oct 12, 2004 at 15:17 UTC
Have you looked at Parse::RecDescent? For large parsing jobs with lots of options it seems to make a lot of sense. The BNF is human-readable (depending on your human). -- Spring: Forces, Coiled Again!	[reply]