Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

How to split into paragraphs?

by jrw (Monk)
on Nov 16, 2006 at 04:57 UTC ( [id://584367]=perlquestion: print w/replies, xml ) Need Help??

jrw has asked for the wisdom of the Perl Monks concerning the following question:

Clarification: This is not a question about reading one line at a time from a file or about using $/. I already have the entire string in a variable. The string can be thought of as a list of substrings ("paragraphs"). If I can detect the start of each substring ("paragraph"), how can I partition the string into a list which is exactly equivalent to the original string if I join the substrings back together?

Say you have a string like this in $_:

abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5
How would you write code to split this into a list of three paragraphs so that you get the original file back if you join the three paragraphs together? Assume you can use a line ending with : to recognize the start of a paragraph. I'm hoping there's simpler idiom than the one in the test program below:
use strict; use warnings; $_ = <<EOF; abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 EOF my @tmp = split /^(\w+:)/m, $_, -1; my $hdr = shift @tmp; my @list = ($hdr) if defined $hdr && $hdr ne ""; push @list, shift(@tmp) . shift(@tmp) while @tmp; my $cnt = @list; $" = "><"; print "cnt=$cnt, values=<@list>\n";

Replies are listed 'Best First'.
Re: How to split into paragraphs?
by ikegami (Patriarch) on Nov 16, 2006 at 05:39 UTC

    If you're reading from a file, $/ = '' sets paragraph mode.

    local $/ = ''; print OUT ("<$_>") while <IN>;

    Alternatively, here's a solution that works for strings:

    $out = join '', map { "<$_>" } map { /\G((?:(?!\n\n).)*\n+|.+\z)/sg } $in;
      ikegami: I hadn't thought of using map to generate a list when given only a single input -- that's an interesting idea.

      One thing I'm trying to do is avoid repeating the pattern used for detecting the start of each substring. When I try to capture using m//, I end up having to repeat the pattern to stop each match:

      /(START_PATTERN.*?)(?!START_PATTERN)/g

      split seems to say what I want: "here is the thing that separates the paragraphs from each other". But then I have to piece the parts back together again (see my original post's code) and I'm trying to avoid that.

        Ah, I see. Well, I've already provided the building blocks, but they are well hidden. Let me expose them.

        You need something along the lines of /[^$chars]*/, but instead of negatively matching chars, you want to negatively match a regexp.

        The direct equivalent of
        /[^$chars]*/
        for regexps is
        /(?:(?!$re).)*/

        In context,

        # Input the string. my $in = do { local $/; <DATA> }; # Must move "pos" on a match. # Zero-width match won't work. my $start_pat = qr/^\S+/m; # Break the input into paragraghs. my @paras = $in =~ / \G ( $start_pat (?: (?!$start_pat). )* ) /xgs; # Manipulate the paragraghs. @paras = map { "<$_>" } @paras; # Recombine the paragraphs. my $out = join '', @paras; # Output the string. print($out); __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5
      Ikegami, see clarification above. I am partitioning based on being able to detect the start of each substring, not based on a separator between substrings.
Re: How to split into paragraphs?
by BrowserUk (Patriarch) on Nov 16, 2006 at 05:48 UTC

    Try setting $/ = '';.

    #! perl -slw use strict; $/ = ''; # paragraph mode print "'$_'" while <DATA>; __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5

    Prints

    c:\test>junk 'abc: asdf1 asdf2 ' 'def: asdf3 ' 'ghi: asdf4 asdf5 '

    Setting $/ = "\n\n"; would also work for your data if there is exactly one 'blank line' between the paragraphs, but the magical setting of $/ = ''; is more flexible.

    Note this quote from perlvar

    Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk, see clarification above. I already have the entire string in a variable.

        This works.

        print "'$_'" for split m[(?=^\w+?:)]sm, $data;

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to split into paragraphs?
by Samy_rio (Vicar) on Nov 16, 2006 at 05:22 UTC

    Hi jrw, this may help you if I understood your question correctly.

    use strict; use warnings; my $str; while (<DATA>){ chomp; ($_ =~ m/^\w/) ? ($str .= "<\/p>\n<p>$_\n") :($str.="$_\n"); } $str =~ s/^(<\/p>)(.+)$/$2$1/gsi; print $str; __DATA__ abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5

    Regards,
    Velusamy R.


    eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

      Velusamy, see clarification above. I already have the entire string in a variable and want to split it into substrings.

        Try like this,

        use strict; use warnings; my $str = <<EOF; abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 EOF my @str = split/(?=\n+\w)/, $str; print "$_" for @str; #or $str =~ s/(^|\n+)(\w)/$1<\/p>\n<p>$2/gsi; $str =~ s/^(<\/p>)\n*(.+)$/$2$1/gsi; print $str;

        Regards,
        Velusamy R.


        eval"print uc\"\\c$_\""for split'','j)@,/6%@0%2,`e@3!-9v2)/@|6%,53!-9@2~j';

Re: How to split into paragraphs?
by graff (Chancellor) on Nov 17, 2006 at 03:23 UTC
    If you split on the paragraph separators (two or more consecutive linefeeds), and use capturing parens in the split, it's pretty easy:
    use strict; use warnings; $_ = <<EOF; abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 EOF my @tkns = split /(\n{2,})/; my @pars; for ( @tkns ) { if ( /^\n+$/ ) { $pars[$#pars] .= $_; } else { push @pars, $_; } } printf "found %d paragraphs:\n", scalar @pars; print "<", join( "><", @pars ), ">\n";
    That prints:
    found 3 paragraphs: <abc: asdf1 asdf2 ><def: asdf3 ><ghi: asdf4 asdf5 >
Re: How to split into paragraphs?
by gt8073a (Hermit) on Nov 16, 2006 at 17:56 UTC

    Here's what I came up with. This shouldn't fail even if the asdf lines contain colons.

    oops, noticed i'd miss abc: lines if there were no asdf lines.

    ^((\w+):\n((?:[^\n]+\n)+)) -> ^((\w+):\n((?:[^\n]+\n)+)*)
    while ( $data =~ /^((\w+):\n((?:[^\n]+\n)+)*)/gm ) { my ( $key, $val ) = ( $2, $3 ); chomp $val; ## remove pesky \n\n doSomething( $key, $val ); ## store it, print it, ignore it.. }
    JJ
Re: How to split into paragraphs?
by Firefly258 (Beadle) on Nov 23, 2006 at 23:11 UTC
    My personal favourite, least complex way is reading from an "in-memory" file.
    $_ = q| abc: asdf1 asdf2 def: asdf3 ghi: asdf4 asdf5 |; local $/ = ""; # or $/ = "\n\n"; open IN, '<', \$_ or warn "opening \$_ failed"; my $n; $n .= $_ for <IN>; print "exact" if $n eq $_;
    If I wasnt too particular about the double-newlines, I would use split instead.
      Unfortunately, my code has to be compatible with older versions of perl which don't support this, but I agree that it's cool!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://584367]
Approved by Samy_rio
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-25 13:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found