Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Splitting multiline string into words, the stuff between words, and newlines

by ibm1620 (Hermit)
on Feb 24, 2022 at 00:15 UTC ( [id://11141603]=perlquestion: print w/replies, xml ) Need Help??

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

I want to split up an ASCII text document into words (as recognized by /b{wb}), strings of the non-word characters between words, and strings of newlines.

The following code almost works, but instead of treating the newlines as separate tokens, it leaves them appended to the preceding word

#!/usr/bin/env perl use strict; use warnings; my $book = do {local $/; <DATA>}; # slurp the book # Split book into words (delimited by \b{wb}), sequences of newlines, # and sequences of anything else. while ($book =~ /( ( \W+ ) | ( \b{wb}.+?\b{wb} ) | ( \n+ ) ) /xg) { show($1); } print "\n"; # show(): make spaces and newlines visible sub show { my $str = shift; $str =~ tr/\n/$/; $str =~ tr/ /_/; print "{$str}\n"; } __DATA__ --First paragraph-- Second one's followed by only one newline. "Hello," she said, "How's t +ricks?" Third paragraph doesn't end with any punctuation ... and the splitting + works 4th one is separated by two newlines. The End.

The output is:

{--} {First} {_} {paragraph} {--$} <- The newline ('$') should be separate group {Second} {_} {one's} {_} {followed} {_} {by} {_} {only} {_} {one} {_} {newline} {._"} {Hello} {,"_} {she} {_} {said} {,_"} {How's} {_} {tricks} {?"$$} <- the two newlines should be a separate group {Third} {_} {paragraph} {_} {doesn't} {_} {end} {_} {with} {_} {any} {_} {punctuation} {_..._} {and} {_} {the} {_} {splitting} {_} {works} <- Correctly {$$} <- split {4th} {_} {one} {_} {is} {_} {separated} {_} {by} {_} {two} {_} {newlines} {.$$_________} <- should be three separate groups {The} {_} {End} {.$}

I'm wondering what I'm doing wrong, and whether there's a better solution. (Would the split function be preferable?)

Replies are listed 'Best First'.
Re: Splitting multiline string into words, the stuff between words, and newlines
by LanX (Saint) on Feb 24, 2022 at 00:55 UTC
    > I'm wondering what I'm doing wrong,

    \W is a negation of \w but is still including \n

    I negated both with [^\w\n]

    That's the result you wanted? Can't really comment on the rest, looks weird to me.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Yes, that's exactly what I wanted. Thank you.

      (I'm just playing with different ways of building Markov chains, a la dissociated-press.)

Re: Splitting multiline string into words, the stuff between words, and newlines
by salva (Canon) on Feb 24, 2022 at 09:06 UTC
    You can also use split for that in order to not require a regular expression for matching non words:
    my @fragments = grep length, split /(\b{wb}.+?\b{wb}|\n+)/, $book;
    So, you get words, sequences of new lines and then everything else.
      This looks to me like it should work, but it splits the strings of non-words into separate characters!

      "For example ...\n" -> {For}{_}{example}{_}{.}{.}{.}{$}
        That is because \b{wb} matches between those signs.

        This seems to solve the issue:

        my @fragments = grep length, split /(\b{wb}\w.*?\b{wb}|\n+)/, $book;

        But my knowledge of Unicode and the \b{wb} semantics is rather limited so that may have other issues.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11141603]
Approved by LanX
Front-paged by Corion
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-25 08:33 GMT
Find Nodes?
    Voting Booth?

    No recent polls found