Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

[Solved]Need to extract a particular block of lines between two patterns

by chengchl (Acolyte)
on Nov 09, 2017 at 00:39 UTC ( #1202989=perlquestion: print w/replies, xml ) Need Help??

chengchl has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have seen the post here about "Getting lines in a file between two patterns" (http://www.perlmonks.org/?node_id=979946)

I was curious that what if I need to extract only the second block of lines that meet the pattern?

Is there something that we can do via (/START/ .. /END) pattern matching? Many thanks

That is:

I have a text file -

abc efg ... START lines not to be extracted END START lines to be extracted END START lines not to be extracted END ...

I think I find a way. Thanks guys

my $count = 0; 11 while (<$fh_r>) { 12 if (/START/ .. /END/) { 13 $count++ if /START/; 14 print if ($count == 2); 15 } 16 }

--- Updated on Nov. 10, 15pm PST Thank you so much guys. I read through all your answers and appreciate your help. Thank you all and have a nice day.

Replies are listed 'Best First'.
Re: [Solved]Need to extract a particular block of lines between two patterns
by Cristoforo (Curate) on Nov 09, 2017 at 03:34 UTC
    This solution makes use of the next if 1 .. /^END$/; flip flop operator. The 1 is the first line of the file to the first END. The next time it encounters a START END block, it performs the actions in the code.
    #!/usr/bin/perl use strict; use warnings; while (<>) { next if 1 .. /^END$/; if (/^START$/ .. /^END$/) { next if /^START$/; last if /^END$/; print; } }
    This works for the data sample you provided.

      Hi Cristoforo,

      Thank you so much for your help. Please correct me if I understand it wrong - the code will skip the first START ... END pattern but will output the all the following patterns right? That is to say, the third START .. END pattern will be printed out as well even if it's not wanted?

      Thank you so much for the help again!

        Yes, it will skip the first block. It will exit the while loop (last if /^END$/) when it reaches the END for the second block.
Re: Need to extract a particular block of lines between two patterns
by kcott (Bishop) on Nov 10, 2017 at 09:20 UTC

    G'day chengchl,

    Welcome to the Monastery.

    Here's a generic solution for your problem. It handles:

    • Extraction of any block (i.e. there's no hard-coded or constant block number).
    • Extraction of multiple blocks.
    • Blocks of lines actually containing (plural) lines.
    • Rogue START or END tokens within START-END blocks.
    • Specification of wanted blocks in any order.
    • Invalid block specifications (e.g. out of range and non-integer identifiers).

    In production code, you may want to add some form of validation and sanity checking, such that the function is short-circuited if no valid blocks are specified (which could mean not even having to open the input file).

    The following shows the technique (specifically for testing via the command line); you'll need to adapt this to your needs (e.g. change <DATA> to <$fh_r>). I've embedded test data to check all the things I've said it handles; you should create your own test data, which more realistically reflects your actual data, and use that for any proof-of-concept or regression tests.

    #!/usr/bin/env perl use strict; use warnings; my %print_block = map { $_ => 1 } @ARGV; my $found_block = 0; while (<DATA>) { next unless /^START$/ .. /^END$/; ++$found_block, next if /^START$/; next if /^END$/; print if $print_block{$found_block}; } __DATA__ ... line BEFORE any wanted blocks ... START block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 END ... line BETWENN any wanted blocks ... START block B line 1 block B line 2 with rogue START token block B line 3 block B line 4 with rogue END token block B line 5 END ... line BETWENN any wanted blocks ... START block C line 1 block C line 2 with rogue END token block C line 3 block C line 4 with rogue START token block C line 5 END ... line BETWENN any wanted blocks ... START block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 END ... line AFTER any wanted blocks ...

    Some example test runs (the script name is pm_1202989_flip_flop_selection.pl):

    $ pm_1202989_flip_flop_selection.pl $ pm_1202989_flip_flop_selection.pl 99 $ pm_1202989_flip_flop_selection.pl A B C $ pm_1202989_flip_flop_selection.pl 1 block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 $ pm_1202989_flip_flop_selection.pl 1 4 block A line 1 block A line 2 with rogue END token block A line 3 block A line 4 with rogue START token block A line 5 block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 $ pm_1202989_flip_flop_selection.pl 3 4 2 # NOTE: specified order irre +levant block B line 1 block B line 2 with rogue START token block B line 3 block B line 4 with rogue END token block B line 5 block C line 1 block C line 2 with rogue END token block C line 3 block C line 4 with rogue START token block C line 5 block D line 1 block D line 2 with rogue START token block D line 3 block D line 4 with rogue END token block D line 5 $

    [Side note: As you're new here, you may have been surprised by certain responses. You can safely ignore these; a quick perusal of the "Worst Nodes" page should explain why.]

    — Ken

      Hi Ken

      Thank you so much for the help and the kind side note. I really appreciate it!

Re: [Solved]Need to extract a particular block of lines between two patterns
by runrig (Abbot) on Nov 09, 2017 at 21:50 UTC
    Use the return value of the flip-flop:
    my $count; while (<$fh_r>) { if (my $status = /START/ .. /END/) { $count++ if $status == 1; print if $count == 2; } }

      Hi Runrig,

      Thank you so much for the clear explanation! It works perfect on my side. Out of curiosity, I also modified the code to print out the $status each line of the matched patterns - print "$status\t$count\t$_" if $count == 2; And I got the output results as:

      1 2 START 2 2 lines to be extracted 3E0 2 END

      Do you by any chance know why the third line is 3E0 and what does that stand for? Thank you in advance!

        ... why the third line is 3E0 and what does that stand for?

        From the discussion of Range Operators in scalar context (the "flip-flop" operator):

        The right operand is not evaluated while the operator is in the "false" state, and the left operand is not evaluated while the operator is in the "true" state. ... The value returned is either the empty string for false, or a sequence number (beginning with 1) for true. The sequence number is reset for each range encountered. The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint.
        [Emphases added]
        So IOW, the final sequence number matches  qr{ \A \d+ E0 \z }xms See also The Scalar Range Operator ("Exluding Markers" section) and Flipin good, or a total flop? (specifically Re: Flipin good, or a total flop?), both in the Monastery's Tutorials section.


        Give a man a fish:  <%-{-{-{-<

A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1202989]
Approved by beech
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (3)
As of 2020-11-25 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?