How to split into paragraphs?

jrw has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to split into paragraphs? by ikegami (Patriarch) on Nov 16, 2006 at 05:39 UTC
If you're reading from a file, `$/ = ''` sets paragraph mode. `local $/ = ''; print OUT ("<$_>") while <IN>;` [download] Alternatively, here's a solution that works for strings: `$out = join '', map { "<$_>" } map { /\G((?:(?!\n\n).)*\n+\|.+\z)/sg } $in;` [download]	[reply] [d/l] [select]
Re^2: How to split into paragraphs? by jrw (Monk) on Nov 16, 2006 at 12:36 UTC
ikegami: I hadn't thought of using map to generate a list when given only a single input -- that's an interesting idea. One thing I'm trying to do is avoid repeating the pattern used for detecting the start of each substring. When I try to capture using `m//`, I end up having to repeat the pattern to stop each match: `/(START_PATTERN.*?)(?!START_PATTERN)/g` [download] `split` seems to say what I want: "here is the thing that separates the paragraphs from each other". But then I have to piece the parts back together again (see my original post's code) and I'm trying to avoid that.	[reply] [d/l] [select]
Re^3: How to split into paragraphs? by ikegami (Patriarch) on Nov 16, 2006 at 14:02 UTC
Ah, I see. Well, I've already provided the building blocks, but they are well hidden. Let me expose them. You need something along the lines of `/[^$chars]/`, but instead of negatively matching chars, you want to negatively match a regexp. The direct equivalent of `/[^$chars]/` for regexps is `/(?:(?!$re).)/` In context, `# Input the string. my $in = do { local $/; <DATA> }; # Must move "pos" on a match. # Zero-width match won't work. my $start_pat = qr/^\S+/m; # Break the input into paragraghs. my @paras = $in =~ / \G ( $start_pat (?: (?!$start_pat). ) ) /xgs; # Manipulate the paragraghs. @paras = map { "<$_>" } @paras; # Recombine the paragraphs. my $out = join '', @paras; # Output the string. print($out); __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5` [download]	[reply] [d/l] [select]
Re^2: How to split into paragraphs? by jrw (Monk) on Nov 16, 2006 at 12:25 UTC
Ikegami, see clarification above. I am partitioning based on being able to detect the start of each substring, not based on a separator between substrings.	[reply]
Re: How to split into paragraphs? by BrowserUk (Patriarch) on Nov 16, 2006 at 05:48 UTC
Try setting `$/ = '';`. `#! perl -slw use strict; $/ = ''; # paragraph mode print "'$_'" while <DATA>; __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5` [download] Prints `c:\test>junk 'abc: asdf1 asdf2 ' 'def: asdf3 ' 'ghi: asdf4 asdf5 '` [download] Setting `$/ = "\n\n";` would also work for your data if there is exactly one 'blank line' between the paragraphs, but the magical setting of `$/ = '';` is more flexible. Note this quote from perlvar Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: How to split into paragraphs? by jrw (Monk) on Nov 16, 2006 at 12:27 UTC
BrowserUk, see clarification above. I already have the entire string in a variable.	[reply]
Re^3: How to split into paragraphs? by BrowserUk (Patriarch) on Nov 16, 2006 at 12:39 UTC
This works. `print "'$_'" for split m[(?=^\w+?:)]sm, $data;` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: How to split into paragraphs? by Samy_rio (Vicar) on Nov 16, 2006 at 05:22 UTC
Hi jrw, this may help you if I understood your question correctly. `use strict; use warnings; my $str; while (<DATA>){ chomp; ($_ =~ m/^\w/) ? ($str .= "<\/p>\n<p>$_\n") :($str.="$_\n"); } $str =~ s/^(<\/p>)(.+)$/$2$1/gsi; print $str; __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5` [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re^2: How to split into paragraphs? by jrw (Monk) on Nov 16, 2006 at 12:22 UTC
Velusamy, see clarification above. I already have the entire string in a variable and want to split it into substrings.	[reply]
Re^3: How to split into paragraphs? by Samy_rio (Vicar) on Nov 16, 2006 at 13:03 UTC
Try like this, `use strict; use warnings; my $str = <<EOF; abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 EOF my @str = split/(?=\n+\w)/, $str; print "$_" for @str; #or $str =~ s/(^\|\n+)(\w)/$1<\/p>\n<p>$2/gsi; $str =~ s/^(<\/p>)\n*(.+)$/$2$1/gsi; print $str;` [download] Regards, Velusamy R. eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@\|6%,53!-9@2~j';	[reply] [d/l] [select]
Re: How to split into paragraphs? by graff (Chancellor) on Nov 17, 2006 at 03:23 UTC
If you split on the paragraph separators (two or more consecutive linefeeds), and use capturing parens in the split, it's pretty easy: `use strict; use warnings; $_ = <<EOF; abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 EOF my @tkns = split /(\n{2,})/; my @pars; for ( @tkns ) { if ( /^\n+$/ ) { $pars[$#pars] .= $_; } else { push @pars, $_; } } printf "found %d paragraphs:\n", scalar @pars; print "<", join( "><", @pars ), ">\n";` [download] That prints: `found 3 paragraphs: <abc: asdf1 asdf2 ><def: asdf3 ><ghi: asdf4 asdf5 >` [download]	[reply] [d/l] [select]
Re: How to split into paragraphs? by gt8073a (Hermit) on Nov 16, 2006 at 17:56 UTC
Here's what I came up with. This shouldn't fail even if the asdf lines contain colons. oops, noticed i'd miss abc: lines if there were no asdf lines. `^((\w+):\n((?:[^\n]+\n)+)) -> ^((\w+):\n((?:[^\n]+\n)+))` [download] `while ( $data =~ /^((\w+):\n((?:[^\n]+\n)+))/gm ) { my ( $key, $val ) = ( $2, $3 ); chomp $val; ## remove pesky \n\n doSomething( $key, $val ); ## store it, print it, ignore it.. }` [download] JJ	[reply] [d/l] [select]
Re: How to split into paragraphs? by Firefly258 (Beadle) on Nov 23, 2006 at 23:11 UTC
My personal favourite, least complex way is reading from an "in-memory" file. `$_ = q\| abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 \|; local $/ = ""; # or $/ = "\n\n"; open IN, '<', \$_ or warn "opening \$_ failed"; my $n; $n .= $_ for <IN>; print "exact" if $n eq $_;` [download] If I wasnt too particular about the double-newlines, I would use split instead.	[reply] [d/l]
Re^2: How to split into paragraphs? by jrw (Monk) on Nov 28, 2006 at 01:45 UTC
Unfortunately, my code has to be compatible with older versions of perl which don't support this, but I agree that it's cool!	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks