Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^4: phrase match

by ambrus (Abbot)
on Dec 13, 2009 at 12:58 UTC ( [id://812583]=note: print w/replies, xml ) Need Help??


in reply to Re^3: phrase match
in thread phrase match

That is useful sometimes, but here it's not needed, because a lookahead is enough.

Run this:

use warnings; $sentence='kinase inhibitor SET6 activates p16(INK4A) in cell-wall.'; my @phrases = ('kinase i', 'inhibitor', 'tor SET6', 'SET6', 'p16(INK4A +)', 'cell'); my $phrases_re = join '|', map { quotemeta } @phrases; $sentence =~ s/(^| )($phrases_re)(?= |$)/$1#$2#/g; print $sentence, "\n";

You get the output

kinase #inhibitor# #SET6# activates #p16(INK4A)# in cell-wall.

Update: There are ways to do this kind of thing without lookaheads or lookbehinds, just as a curiosity. Replace the substitution statement above with either

$sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;
or
use 5.010; given ($sentence) { s/ / /g; s/(^| )($phrases_re)( |$)/$1# +$2#$3/g; s/ / /g; }

Update: One more alternative is below.

my %phrase; $phrase{$_}++ for @phrases; my @sentence = split /( +)/, $sentence; for (@sentence) { $phrase{$_} and $_ = "#" . $_ . "#"; }; $sentence = join "", @sentence;

Update: Oh, let's not forget this one either.

$sentence =~ s/(?<![^ ])($phrases_re)(?= |$)/#$1#/g;

Replies are listed 'Best First'.
Re^5: phrase match
by JadeNB (Chaplain) on Dec 13, 2009 at 18:29 UTC

    Thanks for pointing out the error in my ‘fixed’ code!

    $sentence =~ s/(^| )($phrases_re)( |$)/$1#$2#$3/g for 0, 1;

    I wanted to point out a non-error in your correction above, since it took me a minute to understand what its purpose was: If you just did the global replacement without the for modifier, then you'd have the same problem that Crackers2 pointed out with my original, that overlapping matches wouldn't be handled (because the leading space of the trailing match would already have been gobbled up by the trailing space of the leading space). If I'm understanding correctly, then the for 0, 1 is just making another pass to pick up any matches that we missed this way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://812583]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (7)
As of 2024-03-28 19:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found