Re: Determing what part of a regex matched.

in reply to Determing what part of a regex matched.

Here is how I would tokenize it... note that \d is a subset of \w, so any tokenizer that uses both is probably broken.

#!/usr/bin/perl -wT
use strict;

my $text = 'The world is foo 2!';

my (@words,@numbers,@spaces,@others);  
while((pos($text)||0) ne length($text)) {
  if ($text =~ /\G([a-zA-Z_]+)/gc) {
    push @words, $1;  # or call whatever handler you want
  } elsif ($text =~ /\G(\d+)/gc) {
    push @numbers, $1;
  } elsif ($text =~ /\G(\s+)/gc) {
    push @spaces, $1;
  } elsif ($text =~ /\G([^\w\s]+)/gc) {
    push @others, $1;
  } else {
    warn "tokenizer is broken\n";
  }

}

print "W: @words\n";
print "N: @numbers\n";
print "S: @spaces\n";
print "O: @others\n";

__END__
W: The world is foo
N: 2
S:        
O: !
[download]

-Blake

In Section Seekers of Perl Wisdom