Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Extracting blocks of text

by walker (Initiate)
on Jan 30, 2004 at 13:56 UTC ( [id://325232]=perlquestion: print w/replies, xml ) Need Help??

walker has asked for the wisdom of the Perl Monks concerning the following question:

Help ... I'm brand new to perl so I apologize if this is a very elementary problem (and please add as many comments to your reply as you can).

I need to extract blocks of text from a large file. The text block starts with a key word ("head") and after one or more lines, a line will end with "tail"

I need every line between the 2 key words including the lines the key words are on.

I've attempted to apply serveral of the examples but no success. Thanks in advance for your assistance.

Replies are listed 'Best First'.
Re: Extracting blocks of text
by Rhose (Priest) on Jan 30, 2004 at 14:38 UTC
    You could also use the range (flip-flop) operator. The sample below will print lines from the line which starts with "head" (^ anchors to the start) to the one which ends with "tail" (\s*$ allows some white space after tail.)

    #!/usr/bin/perl use strict; use warnings; while(<DATA>) { print if /^head/i../tail\s*$/i; } __DATA__ HEAD gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus

    Output

    HEAD gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail

    Update
    If you have the camel book, you can find a discussion on this starting on page 90 (2nd Edition).

Re: Extracting blocks of text
by BrowserUk (Patriarch) on Jan 30, 2004 at 15:01 UTC

    You can use $/ (see perlvar) and set it to a string to control what the diamond operator see's as a line ending. By setting this to 'head' and then 'tail' alternately, you can move through you large file in chunks, discarding the 1st, 3rd, 5th and printing the 2nd, 4th & 6th etc.

    #! perl -slw use strict; open IN, '<', $ARGV[ 0 ] or die $!; $/ = 'head'; while( <IN> ) { local $/ = 'tail'; print scalar <IN>; } close IN; __END__ P:\test>type junk.txt The quick brown fox jumps over the lazy dog 0001 head The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0006 The quick brown fox jumps over the lazy dog 0007 The quick brown fox jumps over the lazy dog 0008 headThe quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail The quick brown fox jumps over the lazy dog 0011 The quick brown fox jumps over the lazy dog 0012 P:\test>235232 junk.txt The quick brown fox jumps over the lazy dog 0002 The quick brown fox jumps over the lazy dog 0003 The quick brown fox jumps over the lazy dog 0004 The quick brown fox jumps over the lazy dog 0005 tail The quick brown fox jumps over the lazy dog 0009 The quick brown fox jumps over the lazy dog 0010 tail

    The caveat is that if the chunks you are discarding (between 'tail' and then next 'head' marker) are very large, they will consume large amounts of memory.

    As implemented above, the 'head' marker is discarded, but the 'tail' marker is printed. Add or delete as neccessary.

    This also assumes that by "including the lines the key words are on.", you do not mean that you want any text preceding the 'head' marker, if the head marker is in the middle of a line, nor anything after the 'tail' marker if it can appear in the middle of a line.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Timing (and a little luck) are everything!

      this has been an educating discussion...how about a twist? I am looking to parse a large file, and extract blocks of text that begin with the word term. I can't always anticipate how the block will end, other than by stating that whenever the word term appears, a new block begins. is there a way to create an array where each element is a text block that begins with the word term, and that element ends immediately before the next occurance of the word term?
      example file: term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc. _END_
      so, this file would hopefully result in an array with 3 elements. another challenge, is that the last text block will not have the word term at the end of it. thanks in advance :-) ad3

        Assuming the file is small enough to slurp, then split does the job nicely:

        #! perl -slw use strict; my @array = split 'term', do{ local $/; <DATA> }; shift @array; ## Discard leading null print '---', "\n", $_, "\n" for @array; __DATA__ term { yada yada 12345 () ... } term only occurs here { could be 30 lines here but never that word again until another block starts yadada } term, etc.

        That discards the term itself. If you want to retain the term in each element, then perhaps the simplest way is to just put it back after the split. Just substitute this line into the above.

        my @array = map{ "term$_" } split 'term', do{ local $/; <DATA> };

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Extracting blocks of text
by pelagic (Priest) on Jan 30, 2004 at 14:17 UTC
    Here is a easy solution:
    #!/usr/bin/perl use strict; my $inputfile = shift; my $withinBlock = 0; open (IN, "<$inputfile") || die "could not open $inputfile\n"; while (<IN>) { if (/head/) { $withinBlock = 1; print $_; if (/tail/) { $withinBlock = 0; print "\n"; } } if ($withinBlock) { print $_; if (/tail/) { $withinBlock = 0; print "\n"; } } } close (IN);
    I run it with file
    bla head gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus
    and it showed
    bla head gugus gugus bla bla gugus gugus bla head gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla tail gugus bla bla gugus gugus bla bla gugus head bla bla gugus gugus tail bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus head bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus bla bla gugus gugus tail gugus gugus
    it does not work properly if after a tail there is a head on the same line ...
    pelagic
      This one worked GREAT !!! I need to print 5 lines after the "tail" key word...and I don't understand why are there's 2 tests for tail and 2 print commands ?
        I need to print 5 lines after the "tail" key word...

        Why didn't you say so in the first place? That would change how people answer the question.

        and I don't understand why are there's 2 tests for tail and 2 print commands ?

        Well, actually, there's no need for the duplication. The following would work just as well -- and would cover your little "amendment" to the original spec:

        #!/usr/bin/perl use strict; my $inputfile = shift; my $withinBlock = 0; open (IN, "<$inputfile") || die "could not open inputfile\n"; while (<IN>) { if (/head/) { $withinBlock = 6; } if ($withinBlock) { print $_; $withingBlock-- unless $withinBlock == 6; } if (/tail/) { $withinBlock = 5; } } close (IN);
        Note that if there is a new "head" line within the five lines that follow a "tail", the $withinblock state variable gets reset to 6, and will stay there till the next "tail". If there is no "head" within the next five lines, it will decrement to 0, turning off the output.

        Another "feature" of this version is that if there is a "tail" line without a previous "head", the five lines following "tail" will still get printed. One more thing: since the head and tail regexes are not anchored, the logic will fire whenever these words happen to show up in the data -- e.g:

        blah blah head This is a bunch of text in a target block. It includes excerpts from a book on animals, which have tails. So this line will cause the output to be turned off after the next five lines, i.e. here. So you won't get to see this line or this one. tail But you'll see this one and these lines too. Now the output is off again, but since we're taking about animals, which all have heads, the output is now on again, and you see the previous and current lines, as well as this and the next two...
Re: Extracting blocks of text
by mr_mischief (Monsignor) on Jan 30, 2004 at 14:41 UTC
    This is a classic case for use of a flag variable.

    # init variable to show we're not in the blcok my $in_block = 0; while ( <> ) # process line by line { $in_block = 1 if /^head/; # test for start of block and # set flag true if needed print if $in_block; # print if we're in the block $in_block = 0 if /tail$/; # test for end of block and # set flag false if needed }

    Sorry if I misunderstood your question, but according to the way I read it I think this is close. Given this file:

    fvewvwef vfewejmnvwev evfjerwvnrevjwe wervkjvwe wevrjvrenwvr head vfjlevnerojvnerve head refejrverjvnerjovnerojvn ercjncer rljnelrkvnervervekjnve tail fknvbekjev nweclkneclknerclkernclenelrknclencekn cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail vjenvlejnvlejnrvlejnvejnvejnvejnvejvnejv head efcjonecjnercjnerjcnerjnc crjencerjncejlrcn tail

    I get this output:

    head refejrverjvnerjovnerojvn ercjncer rljnelrkvnervervekjnve tail fknvbekjev nweclkneclknerclkernclenelrknclencekn cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail head efcjonecjnercjnerjcnerjnc crjencerjncejlrcn tail

    Sometimes a simple procedural style works really well, even if you have bells and whistles available. This could be written the same in almost any language. Perl just makes it easier.



    Christopher E. Stith

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://325232]
Approved by b10m
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2024-04-26 03:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found