walker has asked for the wisdom of the Perl Monks concerning the following question:
Help ... I'm brand new to perl so I apologize if this is a very elementary problem (and please add as many comments to your reply as you can).
I need to extract blocks of text from a large file.
The text block starts with a key word ("head") and after one or more lines, a line will end with "tail"
I need every line between the 2 key words including the lines the key words are on.
I've attempted to apply serveral of the examples but no success.
Thanks in advance for your assistance.
Re: Extracting blocks of text
by Rhose (Priest) on Jan 30, 2004 at 14:38 UTC
|
You could also use the range (flip-flop) operator. The sample below will print lines from the line which starts with "head" (^ anchors to the start) to the one which ends with "tail" (\s*$ allows some white space after tail.)
#!/usr/bin/perl
use strict;
use warnings;
while(<DATA>)
{
print if /^head/i../tail\s*$/i;
}
__DATA__
HEAD gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla tail gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus head bla bla gugus gugus tail
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus head
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
tail gugus gugus
Output
HEAD gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla tail gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus head bla bla gugus gugus tail
Update
If you have the camel book, you can find a discussion on this starting on page 90 (2nd Edition). | [reply] [d/l] [select] |
Re: Extracting blocks of text
by BrowserUk (Patriarch) on Jan 30, 2004 at 15:01 UTC
|
You can use $/ (see perlvar) and set it to a string to control what the diamond operator see's as a line ending. By setting this to 'head' and then 'tail' alternately, you can move through you large file in chunks, discarding the 1st, 3rd, 5th and printing the 2nd, 4th & 6th etc.
#! perl -slw
use strict;
open IN, '<', $ARGV[ 0 ] or die $!;
$/ = 'head';
while( <IN> ) {
local $/ = 'tail';
print scalar <IN>;
}
close IN;
__END__
P:\test>type junk.txt
The quick brown fox jumps over the lazy dog 0001
head The quick brown fox jumps over the lazy dog 0002
The quick brown fox jumps over the lazy dog 0003
The quick brown fox jumps over the lazy dog 0004
The quick brown fox jumps over the lazy dog 0005
tail The quick brown fox jumps over the lazy dog 0006
The quick brown fox jumps over the lazy dog 0007
The quick brown fox jumps over the lazy dog 0008
headThe quick brown fox jumps over the lazy dog 0009
The quick brown fox jumps over the lazy dog 0010
tail The quick brown fox jumps over the lazy dog 0011
The quick brown fox jumps over the lazy dog 0012
P:\test>235232 junk.txt
The quick brown fox jumps over the lazy dog 0002
The quick brown fox jumps over the lazy dog 0003
The quick brown fox jumps over the lazy dog 0004
The quick brown fox jumps over the lazy dog 0005
tail
The quick brown fox jumps over the lazy dog 0009
The quick brown fox jumps over the lazy dog 0010
tail
The caveat is that if the chunks you are discarding (between 'tail' and then next 'head' marker) are very large, they will consume large amounts of memory.
As implemented above, the 'head' marker is discarded, but the 'tail' marker is printed. Add or delete as neccessary.
This also assumes that by "including the lines the key words are on.", you do not mean that you want any text preceding the 'head' marker, if the head marker is in the middle of a line, nor anything after the 'tail' marker if it can appear in the middle of a line.
| [reply] [d/l] |
|
this has been an educating discussion...how about a twist?
I am looking to parse a large file, and extract blocks of text that begin with the word term. I can't always anticipate how the block will end, other than by stating that whenever the word term appears, a new block begins.
is there a way to create an array where each element is a text block that begins with the word term, and that element ends immediately before the next occurance of the word term?
example file:
term {
yada yada
12345
() ...
}
term only occurs here {
could be 30 lines here
but never that word again until
another block starts
yadada
}
term, etc.
_END_
so, this file would hopefully result in an array with 3 elements. another challenge, is that the last text block will not have the word term at the end of it.
thanks in advance :-)
ad3 | [reply] [d/l] |
|
#! perl -slw
use strict;
my @array = split 'term', do{ local $/; <DATA> };
shift @array; ## Discard leading null
print '---', "\n", $_, "\n" for @array;
__DATA__
term {
yada yada
12345
() ...
}
term only occurs here {
could be 30 lines here
but never that word again until
another block starts
yadada
}
term, etc.
That discards the term itself. If you want to retain the term in each element, then perhaps the simplest way is to just put it back after the split. Just substitute this line into the above.
my @array = map{ "term$_" } split 'term', do{ local $/; <DATA> };
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
Re: Extracting blocks of text
by pelagic (Priest) on Jan 30, 2004 at 14:17 UTC
|
#!/usr/bin/perl
use strict;
my $inputfile = shift;
my $withinBlock = 0;
open (IN, "<$inputfile") || die "could not open $inputfile\n";
while (<IN>) {
if (/head/) {
$withinBlock = 1;
print $_;
if (/tail/) {
$withinBlock = 0;
print "\n";
}
}
if ($withinBlock) {
print $_;
if (/tail/) {
$withinBlock = 0;
print "\n";
}
}
}
close (IN);
I run it with file
bla head gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla tail gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus head bla bla gugus gugus tail
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus head
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
tail gugus gugus
and it showed
bla head gugus gugus bla bla gugus gugus
bla head gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla tail gugus bla bla gugus gugus
bla bla gugus head bla bla gugus gugus tail
bla bla gugus gugus bla bla gugus gugus head
bla bla gugus gugus bla bla gugus gugus head
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
bla bla gugus gugus bla bla gugus gugus
tail gugus gugus
it does not work properly if after a tail there is a head on the same line ...
pelagic
| [reply] [d/l] [select] |
|
This one worked GREAT !!!
I need to print 5 lines after the "tail" key word...and I don't understand why are there's 2 tests for tail and 2 print commands ?
| [reply] |
|
I need to print 5 lines after the "tail" key word...
Why didn't you say so in the first place? That would change how people answer the question.
and I don't understand why are there's 2 tests for tail and 2 print commands ?
Well, actually, there's no need for the duplication. The following would work just as well -- and would cover your little "amendment" to the original spec:
#!/usr/bin/perl
use strict;
my $inputfile = shift;
my $withinBlock = 0;
open (IN, "<$inputfile") || die "could not open inputfile\n";
while (<IN>) {
if (/head/) {
$withinBlock = 6;
}
if ($withinBlock) {
print $_;
$withingBlock-- unless $withinBlock == 6;
}
if (/tail/) {
$withinBlock = 5;
}
}
close (IN);
Note that if there is a new "head" line within the five lines that follow a "tail", the $withinblock state variable gets reset to 6, and will stay there till the next "tail". If there is no "head" within the next five lines, it will decrement to 0, turning off the output.
Another "feature" of this version is that if there is a "tail" line without a previous "head", the five lines following "tail" will still get printed. One more thing: since the head and tail regexes are not anchored, the logic will fire whenever these words happen to show up in the data -- e.g:
blah blah
head
This is a bunch of text in a target block.
It includes excerpts from a book on animals,
which have tails. So this line will cause the
output to be turned off
after the next
five
lines,
i.e. here.
So you won't get to see this line
or this one.
tail
But you'll see this one
and
these
lines
too.
Now the output is off again, but since we're taking
about animals, which all have heads, the output is now
on again, and you see the previous and current lines,
as well as this and the next two...
| [reply] [d/l] [select] |
|
Re: Extracting blocks of text
by mr_mischief (Monsignor) on Jan 30, 2004 at 14:41 UTC
|
This is a classic case for use of a flag variable.
# init variable to show we're not in the blcok
my $in_block = 0;
while ( <> ) # process line by line
{
$in_block = 1 if /^head/; # test for start of block and
# set flag true if needed
print if $in_block; # print if we're in the block
$in_block = 0 if /tail$/; # test for end of block and
# set flag false if needed
}
Sorry if I misunderstood your question, but according to the way I read it I think this is close. Given this file:
fvewvwef vfewejmnvwev evfjerwvnrevjwe
wervkjvwe wevrjvrenwvr head
vfjlevnerojvnerve
head refejrverjvnerjovnerojvn ercjncer
rljnelrkvnervervekjnve tail fknvbekjev
nweclkneclknerclkernclenelrknclencekn
cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail
vjenvlejnvlejnrvlejnvejnvejnvejnvejvnejv
head efcjonecjnercjnerjcnerjnc
crjencerjncejlrcn
tail
I get this output:
head refejrverjvnerjovnerojvn ercjncer
rljnelrkvnervervekjnve tail fknvbekjev
nweclkneclknerclkernclenelrknclencekn
cwlknelcnlcwnejnrjnrjcnjcncjncnccjn tail
head efcjonecjnercjnerjcnerjnc
crjencerjncejlrcn
tail
Sometimes a simple procedural style works really well, even if you have bells and whistles available. This could be written the same in almost any language. Perl just makes it easier.
| [reply] [d/l] [select] |
|
|