http://qs321.pair.com?node_id=11141614


in reply to Re: Splitting multiline string into words, the stuff between words, and newlines
in thread Splitting multiline string into words, the stuff between words, and newlines

This looks to me like it should work, but it splits the strings of non-words into separate characters!

"For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}
  • Comment on Re^2: Splitting multiline string into words, the stuff between words, and newlines
  • Download Code

Replies are listed 'Best First'.
Re^3: Splitting multiline string into words, the stuff between words, and newlines
by salva (Canon) on Feb 25, 2022 at 09:22 UTC
    That is because \b{wb} matches between those signs.

    This seems to solve the issue:

    my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}|\n+)/, $book;

    But my knowledge of Unicode and the \b{wb} semantics is rather limited so that may have other issues.

      Not sure 'cause that's 'bout words also including non \w characters.

      And some of 'em even start on apostrophe ;)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Well, specifically for the apostrophe, \b{wb} doesn't seem to take initial ones as part of words. It breaks your sample sentence as follows: {Not} {_} {sure} {_'} {cause} {_} {that's} {_'} {bout} {_} {words} {_} {also} {_} {including}

        But I agree with you that there are probably other cases of words (as defined by \b{wb}) that don't start by a character matching \w.

        At the end, my conclusion is that the only way to handle the OP problem in a way fully consistent with \w{wb} semantics is to just split using it, and maybe repack non word fragments afterwards:

        my $book = "Not sure 'cause that's 'bout words also including ...\n.\n +_\n\n..."; my @fragments; my $last_was_symbol; for (split /\b{wb}/, $book) { if (/[\w\n]/) { $last_was_symbol = 0; push @fragments, $_; } else { if ($last_was_symbol) { $fragments[-1] .= $_; } else { push @fragments, $_; $last_was_symbol = 1; } } } sub show { my $str = shift; $str =~ tr/\n/$/; $str =~ tr/ /_/; print "{$str} "; } show $_ for @fragments; print "\n";
      For my purposes, this is fine. I'm mainly interested in capturing possessives and contractions.