http://qs321.pair.com?node_id=556880

rsriram has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a text file where the start and end of List will be marked with <list> and </list>. I want to code every paragraphs appearing inside this to be tagged as <item> and </item>

For example
Input format

The quick brown fox jumps over the lazy dog. The quick brown fox jumps + over the lazy dog. The quick brown fox jumps over the lazy dog. The +quick brown fox jumps over the lazy dog. The quick brown fox jumps ov +er the lazy dog. The quick brown fox jumps over the lazy dog. <list>The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.</list> The quick brown fox jumps over the lazy dog. The quick brown fox jumps + over the lazy dog. The quick brown fox jumps over the lazy dog. The +quick brown fox jumps over the lazy dog. The quick brown fox jumps ov +er the lazy dog. The quick brown fox jumps over the lazy dog.

Output should be:

The quick brown fox jumps over the lazy dog. The quick brown fox jumps + over the lazy dog. The quick brown fox jumps over the lazy dog. The +quick brown fox jumps over the lazy dog. The quick brown fox jumps ov +er the lazy dog. The quick brown fox jumps over the lazy dog. <list><item>The quick brown fox jumps over the lazy dog.</item> <item>The quick brown fox jumps over the lazy dog.</item> <item>The quick brown fox jumps over the lazy dog.</item></list> The quick brown fox jumps over the lazy dog. The quick brown fox jumps + over the lazy dog. The quick brown fox jumps over the lazy dog. The +quick brown fox jumps over the lazy dog. The quick brown fox jumps ov +er the lazy dog. The quick brown fox jumps over the lazy dog.

Can anyone give a regex pattern to replace this?

Sriram

20060622 Janitored by Corion: Added code tags around data, as per Writeup Formatting Tips

Replies are listed 'Best First'.
Re: Regex to replace a particular part of content
by GrandFather (Saint) on Jun 22, 2006 at 11:18 UTC

    You can't do it easily with a single regex, but you can do it with a modest amount of code:

    use strict; use warnings; while (<DATA>) { if (m|<list>| .. m|</list>|) { s/^(?!<list>)|(?<=<list>)/<item>/g; s-(?=</list>)|(?<!</list>)(?=\n)-</item>-g; } print; } __DATA__ The first quick brown fox jumps over the lazy dog. <list>The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.</list> The last quick brown fox jumps over the lazy dog.

    Prints:

    The first quick brown fox jumps over the lazy dog. <list><item>The quick brown fox jumps over the lazy dog.</item> <item>The quick brown fox jumps over the lazy dog.</item> <item>The quick brown fox jumps over the lazy dog.</item></list> The last quick brown fox jumps over the lazy dog.

    DWIM is Perl's answer to Gödel
      Here is a single regex that seems to do the trick. It does it by using look-around assertions before and after the "The quick ... dog. " phrase we want to modify and by using alternation. Here is the code

      use strict; use warnings; my $qbf = q{The quick brown fox jumps over the lazy dog. }; (my $qbfForRX = $qbf) =~ s{\s}{\\s}g; my $message = $qbf x 7 . qq{\n<list>} . ($qbf. qq{\n}) x 2 . $qbf . qq{</list>\n} . $qbf x 6 . qq{\n}; print $message, qq{\n}; my $rxToChange = qr {(?x) # Use extended syntax (?: # Non-capture group # for alternation # Either ... (?<=\n)(?=$qbfForRX) # Move engine to point # preceded by newline # and followed by # phrase | # ... or ... (?<=<list>)(?=$qbfForRX) # Move engine to point # preceded by <list> # and followed by # phrase ) # Close non-capture # group ($qbfForRX) # Capture phrase (?=(?:\n|</list>)) # If followed by either # newline or </list> }; $message =~ s{$rxToChange}{<item>$^N</item>}g; print $message;
      and when run it produces

      You were right when you said "You can't do it easily ...". Problems like this would become easier if look-behind assertions were able to accept variable width patterns. I wasted some time trying to work around that before thinking of moving the alternation outside the look-behind. I wasted even more time because I was interpolating the phrase directly into the regex not realising that the 'x' flag was eating the spaces in it. Doh!

      Cheers,

      JohnGG

        Now do it when you don't know what the text is up front and the text varies from line to line.

        With this version you might as well just have used a print statement containing the litteral output. I'm sure in OP's real problem the text is not as shown. A better sample may have comprised words like "foo bar baz" and been much shorter, but there were enough lines there that you should probably have been able to read between them. :).


        DWIM is Perl's answer to Gödel
Re: Regex to replace a particular part of content
by shmem (Chancellor) on Jun 22, 2006 at 11:26 UTC
    First off, what did you try so far? Please read Is PM a good place to get answers for homework?.

    Then, what you are referring to as paragraphs in your example are lines. This may seem picky, but the distinction between lines and paragraphs makes sense. There is a paragraph mode reading files. See perlrun, there the switch -0 and the special value 00 for it.

    blah blah blah #line \ blah blah blah #line - paragraph blah blah blah #line / <--- paragraph separator blah blah blah #line \ blah blah blah #line - paragraph blah blah blah #line /

    I would do this:

    #!/usr/bin/perl $file = shift; open(I,"<$file"); local $/; $_ = <I>; s|<list>(.*?)</list>|"<list>".join("\n",map{"<item>$_</item>"} split"\n",$1)."</list>"|ges; print;
    Note that the "\n" for join and split should be "\r\n" for Windows and "\r" for Mac systems.

    You can stuff that into a one-liner:

    perl -p00 -e 's|<list>(.*?)</list>|"<list>".join("\n",map{"<item>$_</i +tem>"}split"\n",$1)."</list>"|ges;' textfile
    Now go and read perlre and perlvar.

    greets,
    --shmem

    _($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                                  /\_/(q    /
    ----------------------------  \__(m.====.(_("always off the crowd"))."
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Regex to replace a particular part of content
by reneeb (Chaplain) on Jun 22, 2006 at 11:25 UTC
    my $bla = qq+The quick brown fox jumps over the lazy dog. The quick br +own fox jumps over the lazy dog. The quick brown fox jumps over the l +azy dog. The quick brown fox jumps over the lazy dog. The quick brown + fox jumps over the lazy dog. The quick brown fox jumps over the lazy + dog. <list>The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog. The quick brown fox jumps over the lazy dog.</list> The quick brown fox jumps over the lazy dog. The quick brown fox jumps + over the lazy dog. The quick brown fox jumps over the lazy dog. The +quick brown fox jumps over the lazy dog. The quick brown fox jumps ov +er the lazy dog. The quick brown fox jumps over the lazy dog.+; $bla =~ s~(<list>)(.*?)(</list>)~$1.subst($2).$3~es; print $bla; sub subst{ my ($string) = @_; print $string; my $res = ""; for(split(/\n/,$string)){ $res .= '<item>'.$_.'</item>'."\n"; } return $res; }
Re: Regex to replace a particular part of content
by Moron (Curate) on Jun 22, 2006 at 11:47 UTC
    my $inlist = 0; while(<>) { chop; if ( $inlist ||= /^(\<list\>)(.*)$/ ) { print "$1<item>"; $_=$2; } $inlist &&= !( /^(.*)(\</list\>)$/ and print "$1</item>" and $_=$2 ); print "$_\n"; }

    -M

    Free your mind