igotlongestname has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to write a regular expression to get data off of large output files. The data that I need is actually on the line right after what I'm searching. For example, an NP with a number below it, a U with a number below it and a Pu with a number below it.
The numbers change, but the elements stay the same above them. How can I use regex to find the nth record on the NEXT line following my search?
Re: regex help!
by halley (Prior) on Sep 15, 2005 at 14:21 UTC
|
Regular expressions find stuff you express. They don't find stuff that you don't express.
You should probably use some sort of scripting language that "wraps around" the regular expression engine, so that you can add some follow-up logic which is inconvenient for the regex engine to perform. Let's call that language Perl.
my $marker = qr/^ NP \s+ U \s+ Pu $/x;
my $columns = qr/^ (\d+) \s+ (\d+) \s+ (\d+) $/x;
while (<>)
{
# If we find our line with the column names,
if (m/$marker/)
{
# Read the following line to look for their numbers.
$_ = <>;
if (m/$columns/)
{
print "NP = $1, U = $2, Pu = $3\n";
}
else
{
print "Line after NP/U/Pu doesn't give numbers.\n";
}
}
}
You could just slurp the whole file and try to scan it for multiple-line patterns at once with a single regular expression, but you said "large output files" so I opted for the iterative solution so it wouldn't be limited by memory.
-- [ e d @ h a l l e y . c c ]
| [reply] [Watch: Dir/Any] [d/l] |
|
<nitpick>
Just pointing out that the end of file may occur trying to do this:
$_ = <>;
I've done this before, and had the ubiquitous forehead-slapping-moment. Really need to check for this, because if the file ends at the wrong place, the <> will try to read from STDIN, which causes an annoying script/human deadlock.
</nitpick>
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: regex help!
by prasadbabu (Prior) on Sep 15, 2005 at 14:24 UTC
|
I think this is your first post, the question is not clear. If your question is clear, it is very easy to answer your question correctly.
Also before posting, you should try something. If you post the code what you tried, others will help you.
If i understood your question correctly,
undef $/;
$a = <DATA>;
$search = 56;
if ($a =~ /(\w+)\n$search/)
{
print "matched: $1";
}
__DATA__
NP
32
U
56
| [reply] [Watch: Dir/Any] [d/l] |
Re: regex help!
by ChrisR (Hermit) on Sep 15, 2005 at 14:31 UTC
|
Since I have seen no code or data, my response may be of no use at all. That being said, if your data looks anything like what is below and you are looking for the same nth field in each line, this may work for you.
use strict;
use warnings;
my $data = join "" , <DATA>;
my @values = ();
(@values) = $data =~ /[P|U|Pu]\n\d+,(\d+)/gx;
print join '-', @values;
exit;
__DATA__
NP
1,2,3,4
U
5,6,7,8
Pu
9,10,11,12
| [reply] [Watch: Dir/Any] [d/l] |
Re: regex help!
by GrandFather (Saint) on Sep 15, 2005 at 16:25 UTC
|
Let me write a hypothetical question for you that may or may not be what you were trying to ask:
Most wise monks, I am very new to Perl but have been given a large data file to read that was generated by an old Fortran program. The data are in pairs of lines with a header line and a data line like this:
000 NP U Pu
001 1.270000 000001 3.141000
002 Lev N Pu
003 0.13 000001 3.277118
004 NP U Pu
005 1.000220 000002 3.098761
006 Yac S Yb
007 10.33000 000001 90000000
I need to extract the NP U P lines of data. I have worked out how to read the file. But I can't figure out how to find the data. My code so far looks like this:
open I,"data.dat";
for($I=0;$I<1000;++$I)
{
$l1=<I>;
chop $L1;
$L2=<I>;
chop $L2;
#find the data here
printf ("%d, %d, %d\n", $N1, $N2, $n3);
}
Can someone help me with the code I need to replace the comment please?
Perl is Huffman encoded by design.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
You are all right. I am new, I tried crap but none of it seemed remotely close. What grandfather asked was my exact question. Thank you for the help.
Yeah I'm new at this and just need help, the books haven't helped me too much on this subject.
| [reply] [Watch: Dir/Any] |
|
It is important to show us the "crap" because that shows that you have at least made an effort. It is also important to show some of the data because a description may not be very clear. As you will have noticed from the earlier replies to your original message, we are inclined to grab an idea and run with it - even if it is hopelessly wrong.
After all that lecturing, here is a solution for you (I suggest you examine this carefully, then reply explaining how you think it works):
use warnings;
use strict;
while (<DATA>)
{
my $match = /(NP\s+)(U\s+)(Pu\s*)/i;
last if ! ($_ = <DATA>);
next if ! $match;
chomp;
my $NP = substr $_, $-[1], $+[1] - $-[1] + 1;
my $N = substr $_, $-[2], $+[2] - $-[2] + 1;
(my $Pu = substr $_, $-[3]) =~ s/(\s)//g;;
$NP =~ s/(\s)//g;
$N =~ s/(\s)//g;
print "NP $NP, N $N, Pu $Pu\n";
}
__DATA__
000 NP U Pu
001 1.270000 000001 3.141000
002 Lev N Pu
003 0.13 000001 3.277118
004 NP U Pu
005 1.000220 000002 3.098761
006 Yac S Yb
007 10.33000 000001 90000000
Note that the sample data is given as part of the script so tht other monks can simply download the entire thing and run it to see that it works. The sample given prints:
NP 1.2700000, N 0000013, Pu 3.141000
NP 1.0002200, N 0000023, Pu 3.098761
Perl is Huffman encoded by design.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
so, with trivial variants on method above:
#!C:/Perl/bin
use strict; # no warnings because using uninit values below
use Data::Dumper::Simple;
use vars qw ( @nomatch $I1 $I2 $I3 $L1 $L2 @data $i $j );
while (<DATA>)
{
push @data,$_ ;
}
{
while (@data)
{
$L2 = pop @data;
chomp $L2;
#print "\$L2 is: $L2\n";
$L1 = pop @data;
chomp $L1;
#print "\$L1 is: $L1\n";
#find the data here
if ( $L1 =~ /
\d\d\d # three digits
\s+ # one or more whitespace
NP # exact string, NP
\s+ # one or more whitespace
U # exact string, U
\s+ # one or more whitespace
Pu # exact string, Pu
/x # end match, extended
&& $L2 =~ /
(\d\d\d) # three digits
\s+ # one or more whitespace
(\d\.\d{6}) # digit, period, six digits
\s+ # one or more whitespace
(\d{6}) # six digits
\s+ # one or more whitespace
(\d\.\d{6}) # digit, period, six digits
/x )
{
my $n1 = $1; $I1 = $2; $I2=$3; $I3=$4;
print "\n\tIn linepair ENDING with $n1, NP: $I1, U: $I2, Pu:
+ $I3\n";
}
else
{
push @nomatch,"\n\tNo match on lines $L1\n\t\t\t and $L2\
+n";
}
}
print "\n\n\t No Match pairs follow\n";
warn Dumper (@nomatch);
}
__DATA__
000 NP U Pu
001 1.270000 000001 3.141000
002 Lev N Pu
003 0.13 000001 3.277118
004 NP U Pu
005 1.000220 000002 3.098761
006 Yac S Yb
007 10.33000 000001 90000000
008 NP U Pu
009 2.130000 000140 5.797712
| [reply] [Watch: Dir/Any] [d/l] |
Re: regex help!
by radiantmatrix (Parson) on Sep 15, 2005 at 21:40 UTC
|
Large data files mean slurping is probably bad. So, process a line, and if you got a match, process the next one differently. Here's one way, off the top of my head:
Assume a file where your columns are space-separated, and that looks like:
This is a nifty file, eh?
NP Some U and some other Pu
32 40 1 30 20 123.1 -120
And some other stuff
You'll want the 32, 1, and -120. Since you have essentially columns, you'll use a regex and a split. So (untested):
use IO::File;
my $file = IO::File->new;
$file->open('< data.dat') or die("Can't read the source:$!");
until ($file->eof) {
my $line = $file->getline();
# the regex below will find lines that start with 'NU '
# and have the other fields you want somewhere, surrounded
# with spaces. YMMV.
if ($line =~ /^NU \s .* \s U \s .* \s Pu \s/sx ) {
# we want to get the values from the next line
# first, we find the column indexes we want...
my @col = split(qr/\s/s, $line); #split on whitespace
my %index;
for (0..@col-1) {
$index{$1} = $_ if $col[$_] =~ /^(NU|U|Pu)$/;
}
# now we get the next line and split it into columns
$line = $file->getline();
chomp($line);
@col = split(qr/\s/s, $line); # we can safely reuse @col
# now print the appropriate values using the indexes we captured
+.
foreach (keys %index) {
printf "%3s = '%s'\n", $_, $col[$index{$_}];
}
} # end of if
} # end of until
I suggest that your file is probably not as ugly; if you post a sample of the file with a clearer description, I bet I (or someone) could come up with more elegant code.
<-radiant.matrix->
Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: regex help!
by svenXY (Deacon) on Sep 15, 2005 at 14:55 UTC
|
Hi,
as far as I understood this, the OP searched for this (although it is not a regex)
#!/usr/bin/perl
use strict;
use warnings;
my $key;
my %data;
while ($key = <DATA>) {
chomp $key;
my $val = <DATA>; chomp $val;
push (@{$data{$key}}, $val);
}
print "First occurrence of U: " . $data{'U'}[0] . "\n";
print "Second occurrence of Pu: " . $data{'Pu'}[1] . "\n";
__DATA__
NP
111
U
222
Pu
333
NP
2-111
U
2-222
Pu
2-333
Regards,
svenXY | [reply] [Watch: Dir/Any] [d/l] |
Re: regex help!
by QM (Parson) on Sep 15, 2005 at 15:19 UTC
|
I generally handle the toy cases like this (stealing halley's code above, with some generalization):
my $marker =
qr/^\s* NP \s+ U \s+ Pu \s* $/x;
my $columns =
qr/^\s* (\d+) \s+ (\d+) \s+ (\d+) \s* $/x;
my $found;
while (<>)
{
# looking for markers
if (not $found)
{
$found = 1 if ($marker);
}
# found markers, get columns
else
{
my ($NP, $U, $Pu);
if ( ($NP, $U, $Pu) = /$columns/ )
{
do_something_with( $NP, $U, $Pu);
}
else
{
warn "Didn't see columns, ";
}
# reset to look for more markers
$found = 0;
}
}
I prefer this as there's only one while(<>), so it's harder to screw up the end of file issue.
If you need to do this only once, comment out
$found = 0;
-QM
--
Quantum Mechanics: The dreams stuff is made of
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: regex help!
by ambrus (Abbot) on Sep 15, 2005 at 21:58 UTC
|
use warnings;
@input = (
'U Np Pu',
'238 237 244',
);
@name = $input[0] =~ /\S+/g;
@number{@name} = $input[1] =~ /\S+/g;
print "The weight of Pu is ", $number{"Pu"}, "\n";
__END__
| [reply] [Watch: Dir/Any] [d/l] |
|
|