Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

regex help!

by igotlongestname (Acolyte)
on Sep 15, 2005 at 14:06 UTC ( [id://492223]=perlquestion: print w/replies, xml ) Need Help??

igotlongestname has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a regular expression to get data off of large output files. The data that I need is actually on the line right after what I'm searching. For example, an NP with a number below it, a U with a number below it and a Pu with a number below it. The numbers change, but the elements stay the same above them. How can I use regex to find the nth record on the NEXT line following my search?

Replies are listed 'Best First'.
Re: regex help!
by halley (Prior) on Sep 15, 2005 at 14:21 UTC
    Regular expressions find stuff you express. They don't find stuff that you don't express.

    You should probably use some sort of scripting language that "wraps around" the regular expression engine, so that you can add some follow-up logic which is inconvenient for the regex engine to perform. Let's call that language Perl.

    my $marker = qr/^ NP \s+ U \s+ Pu $/x; my $columns = qr/^ (\d+) \s+ (\d+) \s+ (\d+) $/x; while (<>) { # If we find our line with the column names, if (m/$marker/) { # Read the following line to look for their numbers. $_ = <>; if (m/$columns/) { print "NP = $1, U = $2, Pu = $3\n"; } else { print "Line after NP/U/Pu doesn't give numbers.\n"; } } }

    You could just slurp the whole file and try to scan it for multiple-line patterns at once with a single regular expression, but you said "large output files" so I opted for the iterative solution so it wouldn't be limited by memory.

    --
    [ e d @ h a l l e y . c c ]

      <nitpick>
      Just pointing out that the end of file may occur trying to do this:
      $_ = <>;
      I've done this before, and had the ubiquitous forehead-slapping-moment. Really need to check for this, because if the file ends at the wrong place, the <> will try to read from STDIN, which causes an annoying script/human deadlock.
      </nitpick>

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

Re: regex help!
by prasadbabu (Prior) on Sep 15, 2005 at 14:24 UTC

    I think this is your first post, the question is not clear. If your question is clear, it is very easy to answer your question correctly.

    Also before posting, you should try something. If you post the code what you tried, others will help you.

    If i understood your question correctly,

    undef $/; $a = <DATA>; $search = 56; if ($a =~ /(\w+)\n$search/) { print "matched: $1"; } __DATA__ NP 32 U 56

    Prasad

Re: regex help!
by ChrisR (Hermit) on Sep 15, 2005 at 14:31 UTC
    Since I have seen no code or data, my response may be of no use at all. That being said, if your data looks anything like what is below and you are looking for the same nth field in each line, this may work for you.
    use strict; use warnings; my $data = join "" , <DATA>; my @values = (); (@values) = $data =~ /[P|U|Pu]\n\d+,(\d+)/gx; print join '-', @values; exit; __DATA__ NP 1,2,3,4 U 5,6,7,8 Pu 9,10,11,12
Re: regex help!
by GrandFather (Saint) on Sep 15, 2005 at 16:25 UTC

    Let me write a hypothetical question for you that may or may not be what you were trying to ask:

    Most wise monks, I am very new to Perl but have been given a large data file to read that was generated by an old Fortran program. The data are in pairs of lines with a header line and a data line like this:

    000 NP U Pu 001 1.270000 000001 3.141000 002 Lev N Pu 003 0.13 000001 3.277118 004 NP U Pu 005 1.000220 000002 3.098761 006 Yac S Yb 007 10.33000 000001 90000000

    I need to extract the NP U P lines of data. I have worked out how to read the file. But I can't figure out how to find the data. My code so far looks like this:

    open I,"data.dat"; for($I=0;$I<1000;++$I) { $l1=<I>; chop $L1; $L2=<I>; chop $L2; #find the data here printf ("%d, %d, %d\n", $N1, $N2, $n3); }

    Can someone help me with the code I need to replace the comment please?


    Perl is Huffman encoded by design.
      You are all right. I am new, I tried crap but none of it seemed remotely close. What grandfather asked was my exact question. Thank you for the help. Yeah I'm new at this and just need help, the books haven't helped me too much on this subject.

        It is important to show us the "crap" because that shows that you have at least made an effort. It is also important to show some of the data because a description may not be very clear. As you will have noticed from the earlier replies to your original message, we are inclined to grab an idea and run with it - even if it is hopelessly wrong.

        After all that lecturing, here is a solution for you (I suggest you examine this carefully, then reply explaining how you think it works):

        use warnings; use strict; while (<DATA>) { my $match = /(NP\s+)(U\s+)(Pu\s*)/i; last if ! ($_ = <DATA>); next if ! $match; chomp; my $NP = substr $_, $-[1], $+[1] - $-[1] + 1; my $N = substr $_, $-[2], $+[2] - $-[2] + 1; (my $Pu = substr $_, $-[3]) =~ s/(\s)//g;; $NP =~ s/(\s)//g; $N =~ s/(\s)//g; print "NP $NP, N $N, Pu $Pu\n"; } __DATA__ 000 NP U Pu 001 1.270000 000001 3.141000 002 Lev N Pu 003 0.13 000001 3.277118 004 NP U Pu 005 1.000220 000002 3.098761 006 Yac S Yb 007 10.33000 000001 90000000

        Note that the sample data is given as part of the script so tht other monks can simply download the entire thing and run it to see that it works. The sample given prints:

        NP 1.2700000, N 0000013, Pu 3.141000 NP 1.0002200, N 0000023, Pu 3.098761

        Perl is Huffman encoded by design.
        so, with trivial variants on method above:
        #!C:/Perl/bin use strict; # no warnings because using uninit values below use Data::Dumper::Simple; use vars qw ( @nomatch $I1 $I2 $I3 $L1 $L2 @data $i $j ); while (<DATA>) { push @data,$_ ; } { while (@data) { $L2 = pop @data; chomp $L2; #print "\$L2 is: $L2\n"; $L1 = pop @data; chomp $L1; #print "\$L1 is: $L1\n"; #find the data here if ( $L1 =~ / \d\d\d # three digits \s+ # one or more whitespace NP # exact string, NP \s+ # one or more whitespace U # exact string, U \s+ # one or more whitespace Pu # exact string, Pu /x # end match, extended && $L2 =~ / (\d\d\d) # three digits \s+ # one or more whitespace (\d\.\d{6}) # digit, period, six digits \s+ # one or more whitespace (\d{6}) # six digits \s+ # one or more whitespace (\d\.\d{6}) # digit, period, six digits /x ) { my $n1 = $1; $I1 = $2; $I2=$3; $I3=$4; print "\n\tIn linepair ENDING with $n1, NP: $I1, U: $I2, Pu: + $I3\n"; } else { push @nomatch,"\n\tNo match on lines $L1\n\t\t\t and $L2\ +n"; } } print "\n\n\t No Match pairs follow\n"; warn Dumper (@nomatch); } __DATA__ 000 NP U Pu 001 1.270000 000001 3.141000 002 Lev N Pu 003 0.13 000001 3.277118 004 NP U Pu 005 1.000220 000002 3.098761 006 Yac S Yb 007 10.33000 000001 90000000 008 NP U Pu 009 2.130000 000140 5.797712
Re: regex help!
by radiantmatrix (Parson) on Sep 15, 2005 at 21:40 UTC

    Large data files mean slurping is probably bad. So, process a line, and if you got a match, process the next one differently. Here's one way, off the top of my head:

    Assume a file where your columns are space-separated, and that looks like:

    This is a nifty file, eh? NP Some U and some other Pu 32 40 1 30 20 123.1 -120 And some other stuff

    You'll want the 32, 1, and -120. Since you have essentially columns, you'll use a regex and a split. So (untested):

    use IO::File; my $file = IO::File->new; $file->open('< data.dat') or die("Can't read the source:$!"); until ($file->eof) { my $line = $file->getline(); # the regex below will find lines that start with 'NU ' # and have the other fields you want somewhere, surrounded # with spaces. YMMV. if ($line =~ /^NU \s .* \s U \s .* \s Pu \s/sx ) { # we want to get the values from the next line # first, we find the column indexes we want... my @col = split(qr/\s/s, $line); #split on whitespace my %index; for (0..@col-1) { $index{$1} = $_ if $col[$_] =~ /^(NU|U|Pu)$/; } # now we get the next line and split it into columns $line = $file->getline(); chomp($line); @col = split(qr/\s/s, $line); # we can safely reuse @col # now print the appropriate values using the indexes we captured +. foreach (keys %index) { printf "%3s = '%s'\n", $_, $col[$index{$_}]; } } # end of if } # end of until

    I suggest that your file is probably not as ugly; if you post a sample of the file with a clearer description, I bet I (or someone) could come up with more elegant code.

    <-radiant.matrix->
    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: regex help!
by svenXY (Deacon) on Sep 15, 2005 at 14:55 UTC
    Hi,
    as far as I understood this, the OP searched for this (although it is not a regex)
    #!/usr/bin/perl use strict; use warnings; my $key; my %data; while ($key = <DATA>) { chomp $key; my $val = <DATA>; chomp $val; push (@{$data{$key}}, $val); } print "First occurrence of U: " . $data{'U'}[0] . "\n"; print "Second occurrence of Pu: " . $data{'Pu'}[1] . "\n"; __DATA__ NP 111 U 222 Pu 333 NP 2-111 U 2-222 Pu 2-333

    Regards,
    svenXY
Re: regex help!
by QM (Parson) on Sep 15, 2005 at 15:19 UTC
    I generally handle the toy cases like this (stealing halley's code above, with some generalization):
    my $marker = qr/^\s* NP \s+ U \s+ Pu \s* $/x; my $columns = qr/^\s* (\d+) \s+ (\d+) \s+ (\d+) \s* $/x; my $found; while (<>) { # looking for markers if (not $found) { $found = 1 if ($marker); } # found markers, get columns else { my ($NP, $U, $Pu); if ( ($NP, $U, $Pu) = /$columns/ ) { do_something_with( $NP, $U, $Pu); } else { warn "Didn't see columns, "; } # reset to look for more markers $found = 0; } }
    I prefer this as there's only one while(<>), so it's harder to screw up the end of file issue.

    If you need to do this only once, comment out

    $found = 0;

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: regex help!
by ambrus (Abbot) on Sep 15, 2005 at 21:58 UTC

    I usually parse such text files (text files with column headers) like this:

    use warnings; @input = ( 'U Np Pu', '238 237 244', ); @name = $input[0] =~ /\S+/g; @number{@name} = $input[1] =~ /\S+/g; print "The weight of Pu is ", $number{"Pu"}, "\n"; __END__

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://492223]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (6)
As of 2024-03-29 08:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found