Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Splitting multiline string into words, the stuff between words, and newlines

by ibm1620 (Friar)
on Feb 24, 2022 at 12:50 UTC ( #11141614=note: print w/replies, xml ) Need Help??


in reply to Re: Splitting multiline string into words, the stuff between words, and newlines
in thread Splitting multiline string into words, the stuff between words, and newlines

This looks to me like it should work, but it splits the strings of non-words into separate characters!

"For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}
  • Comment on Re^2: Splitting multiline string into words, the stuff between words, and newlines
  • Download Code

Replies are listed 'Best First'.
Re^3: Splitting multiline string into words, the stuff between words, and newlines
by salva (Canon) on Feb 25, 2022 at 09:22 UTC
    That is because \b{wb} matches between those signs.

    This seems to solve the issue:

    my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}|\n+)/, $book;

    But my knowledge of Unicode and the \b{wb} semantics is rather limited so that may have other issues.

      Not sure 'cause that's 'bout words also including non \w characters.

      And some of 'em even start on apostrophe ;)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Well, specifically for the apostrophe, \b{wb} doesn't seem to take initial ones as part of words. It breaks your sample sentence as follows: {Not} {_} {sure} {_'} {cause} {_} {that's} {_'} {bout} {_} {words} {_} {also} {_} {including}

        But I agree with you that there are probably other cases of words (as defined by \b{wb}) that don't start by a character matching \w.

        At the end, my conclusion is that the only way to handle the OP problem in a way fully consistent with \w{wb} semantics is to just split using it, and maybe repack non word fragments afterwards:

        my $book = "Not sure 'cause that's 'bout words also including ...\n.\n +_\n\n..."; my @fragments; my $last_was_symbol; for (split /\b{wb}/, $book) { if (/[\w\n]/) { $last_was_symbol = 0; push @fragments, $_; } else { if ($last_was_symbol) { $fragments[-1] .= $_; } else { push @fragments, $_; $last_was_symbol = 1; } } } sub show { my $str = shift; $str =~ tr/\n/$/; $str =~ tr/ /_/; print "{$str} "; } show $_ for @fragments; print "\n";
      For my purposes, this is fine. I'm mainly interested in capturing possessives and contractions.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11141614]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2022-12-07 10:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?