REGEX on multiple lines

igotlongestname has asked for the wisdom of the Perl Monks concerning the following question:

I posted on here once before, a question about REGEX stuff and you guys were fantastically helpful. I have eventually gotten so that I can write (probably very ugly) but effective scripts for getting the info that I need, so first off thanks to you all. My problem now is that I want to grab multiple lines of data, which occurs repeatedly in the file. For example:

0AVERAGE COMPOSITION IN PINS.  NUMBER DENSITIES IN 1.0E+24/CM3,  WT% P
+ER MASS INITIAL HEAVY ISOTOPES.
 ----------------------------  FOR BA-ELEMENTS WITH EID>99100, WT% IS 
+THE PERCENTAGE LEFT (FRACTION).
0 EID:     Cm-243
  ND : 8.7352E-08
      0 
      0  3822 
      0  3278     0 
      0  3260     0   +++ 
      0  3242     0   +++   +++ 
      0  3157     0     0     0   +++ 
      0  3096     0     0     0   +++   +++ 
      0  3170     0     0     0     0     0     0 
      0  3772  3170  3096  3157  3242  3260  3278  3822*
      0     0     0     0     0     0     0     0     0     0 
1GE 12 Bundle VOID=0%                                                 
+ >> PHOENUT /1.2.8   / << CORE MASTER 
    9  COMPOS                      CASE= 1 RP=  5 V= 2.9 CO= 0 B= 3307
+ 2007-01-30  13.38.50  Page 668  Job0000
 
 
0AVERAGE COMPOSITION IN PINS.  NUMBER DENSITIES IN 1.0E+24/CM3,  WT% P
+ER MASS INITIAL HEAVY ISOTOPES.
 ----------------------------  FOR BA-ELEMENTS WITH EID>99100, WT% IS 
+THE PERCENTAGE LEFT (FRACTION).
0 EID:     Pu-238
  ND : 7.0913E-06
      1 
      1  3667 
      0  3283     0 
      0  3266     0   +++ 
      0  3250     0   +++   +++ 
      0  3192     0     0     0   +++ 
      0  3151     0     0     0   +++   +++ 
      0  3204     0     0     0     0     0     0 
      1  3630  3204  3151  3192  3250  3266  3283  3667*
      1     1     0     0     0     0     0     0     1     1 
1GE 12 Bundle VOID=0%                                                 
+ >> PHOENUT /1.2.8   / << CORE MASTER 
    9  COMPOS                      CASE= 1 RP=  5 V= 2.9 CO= 0 B= 3307
+ 2007-01-30  13.38.50  Page 669  Job0000
[download]

In this example I want to grab all the info about Pu-238, where information for many other elements occurs before and after Pu-238. In addition, there are multiple statepoints throughout the file, therefore multiple occurances of Pu-238. I know that Pu-238 (or whatever isotope I want to search for) is a unique identifier, it's just grabbing all the numerical data, in the format already in the file, that is my problem. I started some code, which is attached below, but it is definitely not complete since I wasn't sure what the best way to grab multiple lines and then return it to an output file is. Any suggestions? Thanks!

#!/usr/local/bin/perl -w
use IO::File;
my $file = IO::File->new;
print "Enter the output file you would like to analyze: ";
chomp ($filename = <STDIN>);
print "Enter the isotope you want to extract (ex: Am-241): ";
chomp ($iso= <STDIN>);
$file->open("< $filename") or die("Can't read the source:$!");

open(OUT, ">Comp_$filename");

select (OUT);

@iso=();
until ($file->eof) {
   my $line = $file->getline();
   if($line =~ /"$iso"/) {
      $line = $file->getline();
      chomp($line);
      @col1 = split(qr/\s+/s, $line);
      push(@iso,"$col1[1] $col[2] $col[3]");
      $line = $file->getline();
      chomp($line);
      @col1 = split(qr/\s+/s, $line);
      push(@iso,"$col1[1]");
#I INTENDED TO DO THIS SAME PROCESS OVER AND OVER UNTIL THE FINAL LINE
+ WAS PROCESSED, THEN LET THE REGEX SEARCH FOR THE NEXT INSTANCE OF WH
+ATEVER IS DESIRED
   }
} # end of until
#for($i=1; $i<=28; $i++){
#      print "UNSURE WHAT THE BEST WAY TO PRINT IN ORDER IS";
#}

close(OUT);
[download]

In looking at how I'm approaching it, I feel there must be a better way to grab multiple lines and save it in the form it's already in to access later, but unsure how to do this, or if some other approach would work well. Also, does using a regex this way work (meaning trying to input a variable into it, as in the form =~ /"$iso"/)? Any help would be appreciated ... thanks!

Comment on REGEX on multiple lines Select or Download Code

Replies are listed 'Best First'.
Re: REGEX on multiple lines by GrandFather (Saint) on Jan 30, 2007 at 20:02 UTC
A "trick" for reading reguarly formated material like this is to change the input seperator to the header string for each block. Consider: `use strict; use warnings; $/ = '0AVERAGE COMPOSITION IN PINS.'; while (<DATA>) { next if ! /EID: Pu-238/; chomp; print "$/$_"; } __DATA__` [download] Read more... data per OP's sample (2 kB) Prints: 0AVERAGE COMPOSITION IN PINS. NUMBER DENSITIES IN 1.0E+24/CM3, WT% P +ER MASS INITIAL HEAVY ISOTOPES. ---------------------------- FOR BA-ELEMENTS WITH EID>99100, WT% IS +THE PERCENTAGE LEFT (FRACTION). 0 EID: Pu-238 ND : 7.0913E-06 1 1 3667 0 3283 0 0 3266 0 +++ 0 3250 0 +++ +++ 0 3192 0 0 0 +++ 0 3151 0 0 0 +++ +++ 0 3204 0 0 0 0 0 0 1 3630 3204 3151 3192 3250 3266 3283 3667* 1 1 0 0 0 0 0 0 1 1 1GE 12 Bundle VOID=0% + >> PHOENUT /1.2.8 / << CORE MASTER 9 COMPOS CASE= 1 RP= 5 V= 2.9 CO= 0 B= 3307 + 2007-01-30 13.38.50 Page 669 Job0000 [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: REGEX on multiple lines by johngg (Canon) on Jan 30, 2007 at 20:22 UTC
Possibly a minor point but, from what I remember of my Fortran programming days and line printer control characters, I suspect that the actual header string for each block is the `1GE 12 Bundle VOID=0%`. If memory serves, "1" meant throw a page, "0" meant double-line spacing, " " meant single-line spacing and "+" meant over-print. So I think that would mean that the "Pu-238" data actually appeared on Page 668, just in case the stuff at the top of the page has relevance to the OP's problem. Cheers, JohnGG	[reply] [d/l]
Re^2: REGEX on multiple lines by igotlongestname (Acolyte) on Jan 31, 2007 at 15:14 UTC
Thank you sir, that was exactly what I needed. I attempted to incorporate what Graff said below, but quickly found that it what he said is above my level, at least at the moment. I still am quite new and learning things, so thank you so much for your suggestion and coding. I'll attach what I ended up with here in case you see something "bad" or whatever, I simply added what I wanted before it, and allowed a variable definition in the regex search. I understand what Graff was saying enough to catch that what I'm doing is a bad idea, but in time I'll learn better ways and make do with my limited knowledge for now. Thanks to you and graff both! `#!/usr/local/bin/perl -w use strict; use warnings; print "Enter the output file you would like to analyze: "; chomp (my $filename = <STDIN>); print "Enter the isotope you want to extract (ex: Am-241): "; chomp (my $iso= <STDIN>); open(IN, "<", $filename) or die("Can't read the source:$!"); open(OUT, ">Comp_$filename"); select (OUT); $/ = '0AVERAGE COMPOSITION IN PINS.'; while (<IN>) { next if ! /$iso/; chomp; print "$/$_"; }` [download]	[reply] [d/l]
Re^3: REGEX on multiple lines by GrandFather (Saint) on Jan 31, 2007 at 19:39 UTC
Mostly what Graff was suggesting was that you should get your input from the command line parameters rather than prompt the user for them. That makes it easier to use the script in an automated context where you may wish to do several runs with different parameters perhaps. In essence it means replacing: `print "Enter the output file you would like to analyze: "; chomp (my $filename = <STDIN>); print "Enter the isotope you want to extract (ex: Am-241): "; chomp (my $iso= <STDIN>);` [download] with something like: # Validate parameters @ARGV == 2 or error ("Too few parameters"); $ARGV[0] =~ /[A-Z][a-z]?-\d{1,3}/ or error ("Expected an isotope first +"); -f $ARGV[1] or error ("<$ARGV[1]> is not a file"); # Extract parameters my ($iso, $filename) = @ARGV; ... sub error { my $msg = shift; print <<"USAGE"; Error: $msg. ExtractIso parses a given dibbly report file and extracts the record for a given isotope. Use as: ExtractIso <iso> <filename> <iso> is the isotope to be extracted. For example Am-214. <filename> is the filename (with path as required) of the record file. For example: Extract Am-214 dibbly.dat would extract and print the record for Am-214 from the dibbly.dat file in the current directory. USAGE exit -1; } [download] The error sub uses a HEREDOC to provide an error diagnostic and usage information. DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: REGEX on multiple lines by graff (Chancellor) on Jan 31, 2007 at 03:40 UTC
In addition to what the other replies above have said (which is the way I would do it), I would strongly recommend using @ARGV to get your input parameters (input file name and pattern to look for). Something like this: #!/usr/local/bin/perl use strict; use warnings; my $Usage = "Usage: $0 search-pattern data.file\n"; ( @ARGV == 2 and -f $ARGV[1] ) or die $Usage; my ( $pattern, $ifile ) = @ARGV; open( IN, "<", $ifile ) or die "$ifile: $!\n$Usage"; $/ = "\n1"; # use FORTRAN page break as input record separator my $opattern = $pattern; $opattern =~ s/[^\w.-]/_/g; # limit range of characters to be my $ofile = "$ifile.$opattern"; # used in output file name open( OUT, ">", $ofile ) or die "$ofile: $!\n"; while (<IN>) { print OUT if ( /\n0 EID:\s+$pattern\s+/ ); } close OUT; [download] When you do it that way, it's easier to automate multiple runs (extract a given pattern from several files, or several patterns from one file, or several patterns from several files, etc...), as well as being easier, quicker and less error-prone to run the script manually.	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks