Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

REGEX on multiple lines

by igotlongestname (Acolyte)
on Jan 30, 2007 at 19:47 UTC ( [id://597434]=perlquestion: print w/replies, xml ) Need Help??

igotlongestname has asked for the wisdom of the Perl Monks concerning the following question:

I posted on here once before, a question about REGEX stuff and you guys were fantastically helpful. I have eventually gotten so that I can write (probably very ugly) but effective scripts for getting the info that I need, so first off thanks to you all. My problem now is that I want to grab multiple lines of data, which occurs repeatedly in the file. For example:
0AVERAGE COMPOSITION IN PINS. NUMBER DENSITIES IN 1.0E+24/CM3, WT% P +ER MASS INITIAL HEAVY ISOTOPES. ---------------------------- FOR BA-ELEMENTS WITH EID>99100, WT% IS +THE PERCENTAGE LEFT (FRACTION). 0 EID: Cm-243 ND : 8.7352E-08 0 0 3822 0 3278 0 0 3260 0 +++ 0 3242 0 +++ +++ 0 3157 0 0 0 +++ 0 3096 0 0 0 +++ +++ 0 3170 0 0 0 0 0 0 0 3772 3170 3096 3157 3242 3260 3278 3822* 0 0 0 0 0 0 0 0 0 0 1GE 12 Bundle VOID=0% + >> PHOENUT /1.2.8 / << CORE MASTER 9 COMPOS CASE= 1 RP= 5 V= 2.9 CO= 0 B= 3307 + 2007-01-30 13.38.50 Page 668 Job0000 0AVERAGE COMPOSITION IN PINS. NUMBER DENSITIES IN 1.0E+24/CM3, WT% P +ER MASS INITIAL HEAVY ISOTOPES. ---------------------------- FOR BA-ELEMENTS WITH EID>99100, WT% IS +THE PERCENTAGE LEFT (FRACTION). 0 EID: Pu-238 ND : 7.0913E-06 1 1 3667 0 3283 0 0 3266 0 +++ 0 3250 0 +++ +++ 0 3192 0 0 0 +++ 0 3151 0 0 0 +++ +++ 0 3204 0 0 0 0 0 0 1 3630 3204 3151 3192 3250 3266 3283 3667* 1 1 0 0 0 0 0 0 1 1 1GE 12 Bundle VOID=0% + >> PHOENUT /1.2.8 / << CORE MASTER 9 COMPOS CASE= 1 RP= 5 V= 2.9 CO= 0 B= 3307 + 2007-01-30 13.38.50 Page 669 Job0000
In this example I want to grab all the info about Pu-238, where information for many other elements occurs before and after Pu-238. In addition, there are multiple statepoints throughout the file, therefore multiple occurances of Pu-238. I know that Pu-238 (or whatever isotope I want to search for) is a unique identifier, it's just grabbing all the numerical data, in the format already in the file, that is my problem. I started some code, which is attached below, but it is definitely not complete since I wasn't sure what the best way to grab multiple lines and then return it to an output file is. Any suggestions? Thanks!
#!/usr/local/bin/perl -w use IO::File; my $file = IO::File->new; print "Enter the output file you would like to analyze: "; chomp ($filename = <STDIN>); print "Enter the isotope you want to extract (ex: Am-241): "; chomp ($iso= <STDIN>); $file->open("< $filename") or die("Can't read the source:$!"); open(OUT, ">Comp_$filename"); select (OUT); @iso=(); until ($file->eof) { my $line = $file->getline(); if($line =~ /"$iso"/) { $line = $file->getline(); chomp($line); @col1 = split(qr/\s+/s, $line); push(@iso,"$col1[1] $col[2] $col[3]"); $line = $file->getline(); chomp($line); @col1 = split(qr/\s+/s, $line); push(@iso,"$col1[1]"); #I INTENDED TO DO THIS SAME PROCESS OVER AND OVER UNTIL THE FINAL LINE + WAS PROCESSED, THEN LET THE REGEX SEARCH FOR THE NEXT INSTANCE OF WH +ATEVER IS DESIRED } } # end of until #for($i=1; $i<=28; $i++){ # print "UNSURE WHAT THE BEST WAY TO PRINT IN ORDER IS"; #} close(OUT);
In looking at how I'm approaching it, I feel there must be a better way to grab multiple lines and save it in the form it's already in to access later, but unsure how to do this, or if some other approach would work well. Also, does using a regex this way work (meaning trying to input a variable into it, as in the form =~ /"$iso"/)? Any help would be appreciated ... thanks!

Replies are listed 'Best First'.
Re: REGEX on multiple lines
by GrandFather (Saint) on Jan 30, 2007 at 20:02 UTC

    A "trick" for reading reguarly formated material like this is to change the input seperator to the header string for each block. Consider:

    use strict; use warnings; $/ = '0AVERAGE COMPOSITION IN PINS.'; while (<DATA>) { next if ! /EID: Pu-238/; chomp; print "$/$_"; } __DATA__

    Prints:

    0AVERAGE COMPOSITION IN PINS. NUMBER DENSITIES IN 1.0E+24/CM3, WT% P +ER MASS INITIAL HEAVY ISOTOPES. ---------------------------- FOR BA-ELEMENTS WITH EID>99100, WT% IS +THE PERCENTAGE LEFT (FRACTION). 0 EID: Pu-238 ND : 7.0913E-06 1 1 3667 0 3283 0 0 3266 0 +++ 0 3250 0 +++ +++ 0 3192 0 0 0 +++ 0 3151 0 0 0 +++ +++ 0 3204 0 0 0 0 0 0 1 3630 3204 3151 3192 3250 3266 3283 3667* 1 1 0 0 0 0 0 0 1 1 1GE 12 Bundle VOID=0% + >> PHOENUT /1.2.8 / << CORE MASTER 9 COMPOS CASE= 1 RP= 5 V= 2.9 CO= 0 B= 3307 + 2007-01-30 13.38.50 Page 669 Job0000

    DWIM is Perl's answer to Gödel
      Possibly a minor point but, from what I remember of my Fortran programming days and line printer control characters, I suspect that the actual header string for each block is the 1GE 12 Bundle VOID=0%. If memory serves, "1" meant throw a page, "0" meant double-line spacing, " " meant single-line spacing and "+" meant over-print. So I think that would mean that the "Pu-238" data actually appeared on Page 668, just in case the stuff at the top of the page has relevance to the OP's problem.

      Cheers,

      JohnGG

      Thank you sir, that was exactly what I needed. I attempted to incorporate what Graff said below, but quickly found that it what he said is above my level, at least at the moment. I still am quite new and learning things, so thank you so much for your suggestion and coding. I'll attach what I ended up with here in case you see something "bad" or whatever, I simply added what I wanted before it, and allowed a variable definition in the regex search. I understand what Graff was saying enough to catch that what I'm doing is a bad idea, but in time I'll learn better ways and make do with my limited knowledge for now. Thanks to you and graff both!
      #!/usr/local/bin/perl -w use strict; use warnings; print "Enter the output file you would like to analyze: "; chomp (my $filename = <STDIN>); print "Enter the isotope you want to extract (ex: Am-241): "; chomp (my $iso= <STDIN>); open(IN, "<", $filename) or die("Can't read the source:$!"); open(OUT, ">Comp_$filename"); select (OUT); $/ = '0AVERAGE COMPOSITION IN PINS.'; while (<IN>) { next if ! /$iso/; chomp; print "$/$_"; }

        Mostly what Graff was suggesting was that you should get your input from the command line parameters rather than prompt the user for them. That makes it easier to use the script in an automated context where you may wish to do several runs with different parameters perhaps. In essence it means replacing:

        print "Enter the output file you would like to analyze: "; chomp (my $filename = <STDIN>); print "Enter the isotope you want to extract (ex: Am-241): "; chomp (my $iso= <STDIN>);

        with something like:

        # Validate parameters @ARGV == 2 or error ("Too few parameters"); $ARGV[0] =~ /[A-Z][a-z]?-\d{1,3}/ or error ("Expected an isotope first +"); -f $ARGV[1] or error ("<$ARGV[1]> is not a file"); # Extract parameters my ($iso, $filename) = @ARGV; ... sub error { my $msg = shift; print <<"USAGE"; Error: $msg. ExtractIso parses a given dibbly report file and extracts the record for a given isotope. Use as: ExtractIso <iso> <filename> <iso> is the isotope to be extracted. For example Am-214. <filename> is the filename (with path as required) of the record file. For example: Extract Am-214 dibbly.dat would extract and print the record for Am-214 from the dibbly.dat file in the current directory. USAGE exit -1; }

        The error sub uses a HEREDOC to provide an error diagnostic and usage information.


        DWIM is Perl's answer to Gödel
Re: REGEX on multiple lines
by graff (Chancellor) on Jan 31, 2007 at 03:40 UTC
    In addition to what the other replies above have said (which is the way I would do it), I would strongly recommend using @ARGV to get your input parameters (input file name and pattern to look for). Something like this:
    #!/usr/local/bin/perl use strict; use warnings; my $Usage = "Usage: $0 search-pattern data.file\n"; ( @ARGV == 2 and -f $ARGV[1] ) or die $Usage; my ( $pattern, $ifile ) = @ARGV; open( IN, "<", $ifile ) or die "$ifile: $!\n$Usage"; $/ = "\n1"; # use FORTRAN page break as input record separator my $opattern = $pattern; $opattern =~ s/[^\w.-]/_/g; # limit range of characters to be my $ofile = "$ifile.$opattern"; # used in output file name open( OUT, ">", $ofile ) or die "$ofile: $!\n"; while (<IN>) { print OUT if ( /\n0 EID:\s+$pattern\s+/ ); } close OUT;
    When you do it that way, it's easier to automate multiple runs (extract a given pattern from several files, or several patterns from one file, or several patterns from several files, etc...), as well as being easier, quicker and less error-prone to run the script manually.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://597434]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-25 07:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found