Re^2: Splitting multiline string into words, the stuff between words, and newlines

in reply to Re: Splitting multiline string into words, the stuff between words, and newlines
in thread Splitting multiline string into words, the stuff between words, and newlines

This looks to me like it should work, but it splits the strings of non-words into separate characters!

"For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}
[download]

Comment on Re^2: Splitting multiline string into words, the stuff between words, and newlines Download Code

Replies are listed 'Best First'.
Re^3: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 25, 2022 at 09:22 UTC
That is because `\b{wb}` matches between those signs. This seems to solve the issue: `my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}\|\n+)/, $book;` [download] But my knowledge of Unicode and the `\b{wb}` semantics is rather limited so that may have other issues.	[reply] [d/l] [select]
Re^4: Splitting multiline string into words, the stuff between words, and newlines by LanX (Saint) on Feb 25, 2022 at 10:22 UTC
Not sure 'cause that's 'bout words also including non \w characters. And some of 'em even start on apostrophe ;) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^5: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 25, 2022 at 11:00 UTC
Well, specifically for the apostrophe, `\b{wb}` doesn't seem to take initial ones as part of words. It breaks your sample sentence as follows: `{Not} {_} {sure} {_'} {cause} {_} {that's} {_'} {bout} {_} {words} {_} {also} {_} {including}` But I agree with you that there are probably other cases of words (as defined by `\b{wb}`) that don't start by a character matching `\w`. At the end, my conclusion is that the only way to handle the OP problem in a way fully consistent with `\w{wb}` semantics is to just split using it, and maybe repack non word fragments afterwards: `my $book = "Not sure 'cause that's 'bout words also including ...\n.\n +_\n\n..."; my @fragments; my $last_was_symbol; for (split /\b{wb}/, $book) { if (/[\w\n]/) { $last_was_symbol = 0; push @fragments, $_; } else { if ($last_was_symbol) { $fragments[-1] .= $_; } else { push @fragments, $_; $last_was_symbol = 1; } } } sub show { my $str = shift; $str =~ tr/\n/$/; $str =~ tr/ /_/; print "{$str} "; } show $_ for @fragments; print "\n";` [download]	[reply] [d/l] [select]
Re^6: Splitting multiline string into words, the stuff between words, and newlines by LanX (Saint) on Feb 25, 2022 at 11:11 UTC
Re^7: Splitting multiline string into words, the stuff between words, and newlines by salva (Canon) on Feb 25, 2022 at 11:19 UTC
Re^4: Splitting multiline string into words, the stuff between words, and newlines by ibm1620 (Hermit) on Feb 26, 2022 at 21:00 UTC
For my purposes, this is fine. I'm mainly interested in capturing possessives and contractions.	[reply]

In Section Seekers of Perl Wisdom