Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^4: Splitting multiline string into words, the stuff between words, and newlines

by LanX (Sage)
on Feb 25, 2022 at 10:22 UTC ( #11141638=note: print w/replies, xml ) Need Help??


in reply to Re^3: Splitting multiline string into words, the stuff between words, and newlines
in thread Splitting multiline string into words, the stuff between words, and newlines

Not sure 'cause that's 'bout words also including non \w characters.

And some of 'em even start on apostrophe ;)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^4: Splitting multiline string into words, the stuff between words, and newlines

Replies are listed 'Best First'.
Re^5: Splitting multiline string into words, the stuff between words, and newlines
by salva (Canon) on Feb 25, 2022 at 11:00 UTC
    Well, specifically for the apostrophe, \b{wb} doesn't seem to take initial ones as part of words. It breaks your sample sentence as follows: {Not} {_} {sure} {_'} {cause} {_} {that's} {_'} {bout} {_} {words} {_} {also} {_} {including}

    But I agree with you that there are probably other cases of words (as defined by \b{wb}) that don't start by a character matching \w.

    At the end, my conclusion is that the only way to handle the OP problem in a way fully consistent with \w{wb} semantics is to just split using it, and maybe repack non word fragments afterwards:

    my $book = "Not sure 'cause that's 'bout words also including ...\n.\n +_\n\n..."; my @fragments; my $last_was_symbol; for (split /\b{wb}/, $book) { if (/[\w\n]/) { $last_was_symbol = 0; push @fragments, $_; } else { if ($last_was_symbol) { $fragments[-1] .= $_; } else { push @fragments, $_; $last_was_symbol = 1; } } } sub show { my $str = shift; $str =~ tr/\n/$/; $str =~ tr/ /_/; print "{$str} "; } show $_ for @fragments; print "\n";
      > \b{wb} doesn't seem to take initial ones as part of words

      good catch!

      > my conclusion is that the only way to handle the OP problem in a way fully consistent with \w{wb} semantics is to just split using it, and maybe repack non word fragments afterwards

      My intuition says split on non-words like whitespace, reject "words" without \w or equivalent characters and repack the rest afterwards.

      I doubt it's possible to cover all desirable edge cases by \b{wb} this will depend on the user's perspective, especially when considering multi-language environments and unicode.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        This seems to work too:
        my @fragments = $book =~ /\G(?:[^\n\w]+?\b{wb})+|.+?\b{wb}/sg;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11141638]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2022-12-01 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?