Hello ScarletRoxanne,
Here’s an approach which replaces each phrase/term with a temporary marker, then removes stopwords, then replaces the markers with their original terms:
use strict;
use warnings;
use Const::Fast;
use Data::Dump;
const my $DELIM => '\034';
my %stops = map { lc $_ => 1 } qw( I am the of and you are );
my @terms = ('manager of sales', 'chairman of the board');
@terms = sort { length $b <=> length $a } @terms; # longest firs
+t
my $file3 = 'I am the Senior Manager of Sales and of Marketing. ' .
'You are the Chairman of the Board of Directors.';
$file3 =~ tr/A-Z/a-z/; # convert to lower case
# replace terms with temporary markers
$file3 =~ s{$terms[$_]}{$DELIM$_$DELIM}gi for 0 .. $#terms;
my @file3 = split /\s+/, $file3;
@file3 = grep { ! exists $stops{$_} } @file3;
for my $entry (@file3)
{
if ($entry =~ /\Q$DELIM\E(\d+)\Q$DELIM\E/)
{
$entry = '*' . $terms[$1] . '*';
}
else
{
$entry =~ s{[[:punct:]]}{}g; # remove punctuation
}
}
print "$_\n" for @file3;
Output:
17:35 >perl 1997_SoPW.pl
senior
*manager of sales*
marketing
*chairman of the board*
directors
17:35 >
Hope that helps,