Split a sentence into words

ThreeMonks has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks. I have a vocabulary and sentence, and want to split the sentence into words in the vocabulary. For example,

my @vocabulary = qw(a abc abcd abd bc);
my $sentence = 'abdaabc';
[download]

How to get the list (abd, a, abc)? I tried the following regexes. Each of them match the sentence, but none of them gives the right answer.

my $pattern = join '|', @vocabulary;

$sentence =~ /^($pattern)+$/;
print $1;

my @list1;
$sentence =~ /^(($pattern)(?{push @list1, $^N}))+$/;
print (join ",", @list1), "\n";

use Regexp::DeferredExecution;
my @list2;
$sentence =~ /^(($pattern)(?{push @list2, $^N}))+$/;
print (join ",", @list2), "\n"
[download]

In the first and the third cases, only the last matched word get printed, while in the second, a list much longer than expected.

Comment on Split a sentence into words Select or Download Code

Replies are listed 'Best First'.
Re: Split a sentence into words by ikegami (Patriarch) on May 30, 2009 at 07:05 UTC
Don't use `my` variables declared outside the regex pattern from within `(?{})`. The problem you are having is that one of the patterns matches, then gets added to `@list1`, then gets unmatched by backtracking. But you never remove it from `@list1` on backtracking. A simple example of this: `>perl -le"'abc1def2' =~ /(?:([a-z])(?{ print $^N }))+2/" a b c b c c d e f` [download] The solution is to use `$^R`. `use strict; use warnings; my @vocabulary = qw( a abc abcd abd bc ); my $sentence = 'abdaabc'; my ($pattern) = map qr/$_/, join '\|', map quotemeta, sort { length($b) <=> length($a) } # optional @vocabulary; use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ @list = @{$^R} }) /x or die("No solution\n"); print( join('-', @list), "\n" ); # abd-a-abc` [download] Without the sort, you'd get `abd-a-a-bc`. If you want all possible solutions: `... use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ push @list, join('-', @{$^R}) }) (?!) /x; die("No solution\n") if !@list; print("$_\n") for @list;` [download] `abd-a-a-bc abd-a-abc` [download]	[reply] [d/l] [select]
Re: Split a sentence into words by akho (Hermit) on May 30, 2009 at 04:45 UTC
`my @vocabulary = qw(abd abcd abc a bc); my $sentence = 'abdaabc'; my $pattern = join '\|', @vocabulary; my @words = $sentence =~ /($pattern)/g;` [download] note that `@vocabulary` has to be sorted in such a way that "longer" words come earlier; i.e. if word x is a prefix of word y, word y must come earlier in the list. Upd Does not actually work; i.e. it works only for some vocabularies; say (abcd, abc, de) will not split 'abcde' right. Things get complicated and computer-sciencey. See bart and ikegami's replies below.	[reply] [d/l] [select]
Re^2: Split a sentence into words by bart (Canon) on May 30, 2009 at 12:27 UTC
note that @vocabulary has to be sorted in such a way that "longer" words come earlier If you depend on a module like Regex::PreSuf, not only will it have the same effect, i.e. matching the longest match possible, but likely, it'll match faster, at least for longer lists, and pre 5.10 perl.	[reply]
Re: Split a sentence into words by bart (Canon) on May 30, 2009 at 12:24 UTC
That reminds me of Abigail's (in)famous "determine if a number is prime using a regex" hack. Where I think that akho's Re: Split a sentence into words falls short, is that I think you ought to be using Perl's backtracking mechanism to make it fit anyway. It's not because parts match, that it'll match as a whole. Here is a dumb, straightforward approach: `# your data my @vocabulary = qw(a abc abcd abd bc); my $sentence = 'abdaabc'; # regex for words my $re = join '\|', @vocabulary; # does it match in any way my $success = $sentence =~ /^(?:$re)$/; # show result print $success \|\| 0;` [download] That displays a rather disappointing yet for a start, encouraging result: `1` [download] Okay, so it matches, but we have no idea how*. We should find a way to mark where it matches, and where it backtracks. But after that, I'm lost. I've tried `(??{code())` in various combinations, trying to capture intermediate state of the regex, including $^R, pos, @- and @+, but I don't get any intuitive results... maybe somebody else can take it up from here? The only other thing that yields interesting results, is `use re 'debug';` [download] but it's not something you can use in a script.	[reply] [d/l] [select]