Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Skipping data on file read

by igotlongestname (Acolyte)
on Jun 12, 2008 at 23:30 UTC ( [id://691817]=perlquestion: print w/replies, xml ) Need Help??

igotlongestname has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a script to pull a variable amount of data, depending on REGEX matches, off an input file. I'm about 90% there, but having trouble since I'm losing a small amount of the data due to how I go about navigating the file. Here is a sample of the input file:
m8 92234.30c 0.0071 92235.30c 0.9300 92238.30c 0.06289 + 8016.30c 2.0 42000.30c 2.5 + c + c BeO(2.86) Axial Reflector TD=3.01 / 95%=2.86 + m9 4009.30c 0.5 8016.30c 0.5 + mt9 beo.01t + c + c BeO(?AllgenCalc) Radial Reflectpr TD=3.01 / 95%=2.86 + m10 4009.30c 0.5 8016.30c 0.5 + mt10 beo.01t + c + c He/Xe(.0218) (72/28) ~.55 mol/L at 300K,1.38MPa, 39.6 g/mol + m11 2004.30c 0.7 + 54124.30c 0.00027 54126.30c 0.00027 54128.30c 0.00576 + 54129.30c 0.07932 54130.30c 0.01224 54131.30c 0.06354 + 54132.30c 0.08067 54134.30c 0.03132 54136.30c 0.02661 + c + c Sodium(0.929) RoomTemp = .97 g/cc, at melt = .929 g/cc + c Liquid = .929 - .000244*(t-371) (t in K) Handbook Ch&Ph + m12 11023.30c 1.0 $ Na (.929 g/cc) frozen/voi +d c + c Lithium(.515) RoomTemp = .534 g/cc, at melt = .515 g/cc + c Liquid = .515 - .000101*(t-454) (t in K) Handbook Ch&Ph
The script I wrote wants to pull off the data that looks like "####.30c", any instance (such as 54132.30c or 11023.30c etc...) including the data that says things like "beo.01t" (on the mt## cards), but not include any other information (needs to ditch comment cards, including lines beginning with "c" or with a "$" in them). The script does all this currently, but the problem it has is that it skips data. For instance, all the mt cards get skipped. I believe the reason, is how I constructed the until loop, with the "$line = <$FILE>" line preceding, and inside of it. I just haven't been able to figure out how to bypass, or circumvent that (including moving file lines, I tried the Tie::File to move around, but quickly got lost). Any thoughts or suggestions? Here is the code I have now (which works as is, but skips some data). Thanks!
#!/usr/local/bin/perl use strict; use warnings; print "Enter the filename to analyze (we can hardwire this later): "; chomp ( my $filename = <STDIN> ); open my $FILE, '<', $filename or die "Can't read the source: $!"; open my $OUT, '>', "Space_Nukes_Rule_$filename" or die "Can't open out +put file: $!"; my $count=0; my ($i, $j, $k, $popindex, $array, $arraytemp); my (@array, @subarray, @arraytemp, @data); while ( my $line = <$FILE> ) { if ( $line =~ /^m\d+/ ) { @arraytemp = ( split qr/\$/s, $line ); #print "@arraytemp"; @array = ( split qr/\s+/s, $arraytemp[0] ); #print "@array\n"; $array=@array; for ( $i=1; $i<$array; $i=$i+2) { push @data, "$array[$i]\n"; } $line = <$FILE>; until ( $line =~ /^c/ or $line =~ /^mt?\d+/ ) { @arraytemp = ( split qr/\$/s, $line ); @array = ( split qr/\s+/s, $arraytemp[0] ); $array=@array; for ( $i=1; $i<$array; $i=$i+2) { push @data, "$array[$i]\n"; } $line = <$FILE>; } } } print "@data\n";
It should be noted that sometimes a card (m8 for example) will have multiple lines of data required, where the continued lines have no continuation character but are just typed. Other times, such as m9, the data is all on one line. I believe the problem occurs when two m## or mt## lines occur without any comment cards in between, then every other card is skipped.

Replies are listed 'Best First'.
Re: Skipping data on file read
by jethro (Monsignor) on Jun 13, 2008 at 00:24 UTC
    Probably someone will tell you the bug, but that still makes this code hard to read, understand and maintain. Especially the multiple reads from $FILE are errorprone. And it really gets dark when your input file has an error in it.

    My suggestion would be to rewrite this code as a state machine. A state machine has one variable with possible values of 0,1,2,3,.... That is the state. Whenever you parse something, for example a line with 'be0.01t' in it, you change the state to reflect where you are in the file. So state 1 might mean "I've just parsed a line with ####.30c, I'm expecting more of them now or blank lines or comments".

    You might even draw a simple diagram now with arrows connecting two states, where one can lead to the other and note the condition on the arrow.

    The new code would look somewhat like this:

    my $state=0; #state 0 you expect a card id or comments while ( my $line = <$FILE> ) { if ($state==0) { if ($line=~/.../) { dosemthing; $state=1; } elsif ($line=~/.../) { dosemthingelse; $state=4; else { print "error in datafile"; $state=0; } } elsif ($state==1) { if ($line=~/.../) { ...
    The code will be more wordy but it will be a lot more mantainable and you will be more confident that your program can read whatever is coming at it

      There are several state machine frameworks on CPAN, the most notable being POE.

      This is also how parsers and lexers are generally written, if you've had experience with those.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Skipping data on file read
by throop (Chaplain) on Jun 13, 2008 at 05:08 UTC
    State machines are good, but I don't think you need one here, because:
    • All comment lines start with 'c'
    • You don't need to look at a previous line to see if you're in the 'm' state. Unless the line starts with 'c', you are
    Try this, after you've opened input file. (Nice job on using the 3 arg form of open, and using warnings and strict btw)
    while (<$FILE>){ # Don't need to use a $line variable; leave it in +$_ next if /^c\s/; # Skip comments $_ = (split /\$/)[0] # Throw away everything to the right of $ (i +f any) # I'm not sure from your description if this is the pattern you +want; I'm guessing foreach my $datum (/ (\w+\.\w\wc) /gx ){ push(@data, $datum)}} # This could be made terser
    I'm assuming that all comment lines start with a 'c' and aren't continued.

    BTW, having an both an array and a scalar named 'array' is confusing and unnecessary. In your code

    $array=@array; for ( $i=1; $i<$array; $i=$i+2)
    Better to just say
    for ( $i=1; $i < @array; $i=$i+2)
    The '<' puts @array in a scalar context, so $i is compared to the length of @array.

    Tell us how it goes
    throop

    Code not tested
      Throop, thank you so much for your suggestions and tips (I have a lot to learn in the Perl thing, but it's sure fun). I actually didn't look back here after the first two posts about state machines (didn't have the foggiest idea of what to do) so I went after the Tie::File routine and got it to do what I wanted it to do, and will attach the code here.

      I think in my original code, I am in the 'm' state as you coined it (I like that name) until a comment occurs, but I also screw up if there are two 'm' states in a row (for instance an 'm1 94235 1.0' followed by 'm2 94235 1.0'), without comments in between, since comments aren't obligatory. What happened when two 'm' states in a row is that the first match occurs (say m1 in the preceding example), the data is parsed correctly, but then a $line = <$FILE> call is encountered INSIDE the loop to increment one line (now landing on the m2 line), where subsequently the next iteration of the loop is called and the same call "$line = <$FILE>" then incremented to the line after 'm2' and the m2 data was skipped.

      What I didn't include, or show is that the file is some thousands of lines long, and the only information I wanted was the data described, but that it also isn't always in the same format, or the same values (i.e. in one file m1 may be 92235.30c, but in another file it may be m1 94241.90c ... all depending upon user input), but it has prescribed formatting.

      Anyways thanks for the help and encouragement, can feel pretty dumb in these forums :-)

      If you get a chance, take a look at the code that works how the way I want. I feel that I should be able to make it more concise using the "$." operator, but couldn't see how. Also, I don't see how the "%hash", mapping each element with a 1 condenses the array down into only the unique keys ... is this a built-in feature of the map function in relation to hashes? Thanks!

      #!/usr/local/bin/perl use strict; use warnings; use Tie::File; print "Enter the filename to analyze (we can hardwire this later): "; chomp ( my $filename = <STDIN> ); open my $FILE, '<', $filename or die "Can't read the source: $!"; open my $CHECK, '>', "Space_Nukes_Rule_$filename" or die "Can't open o +utput file: $!"; open my $OUT, '>', "Out_Space_Nukes_Rule_$filename" or die "Can't op +en output file: $!"; my ($i, $j, $popindex, $array, $arraytemp); my (@array, @subarray, @arraytemp, @data, @line, @INFILE); tie @INFILE, 'Tie::File', $FILE or die "dieeeee"; for ( $i=0; $i<@INFILE; $i++ ) { if ( $INFILE[$i] =~ /^mt?\d+/ ) { @arraytemp = ( split qr/\$/s, $INFILE[$i] ); @array = ( split qr/\s+/s, $arraytemp[0] ); $array=@array; for ( $j=1; $j<$array; $j=$j+2) { push @data, "$array[$j]\n"; } $i++; until ( $INFILE[$i] =~ /^c/ or $INFILE[$i] =~ /^mt?\d+/ ) { @arraytemp = ( split qr/\$/s, $INFILE[$i] ); @array = ( split qr/\s+/s, $arraytemp[0] ); $array=@array; for ( $j=1; $j<$array; $j=$j+2) { push @data, "$array[$j]\n"; } $i++; } $i--; } } print $CHECK "@data\n"; my %hash = map { $_, 1 } @data; my @unique_data = keys %hash; print $OUT "@unique_data";
Re: Skipping data on file read
by pc88mxer (Vicar) on Jun 13, 2008 at 15:39 UTC
    Just want to point out that this will cause problems (i.e. an infinite loop) if you hit EOF (end of file) here:
    $line = <$FILE>; until ( $line =~ /^c/ or $line =~ /^mt?\d+/ ) { ... $line = <$FILE>; }
    I like to write my parsing loops so that I only read lines in one place - like in the outer while statement. Then I only have to check for EOF in one place.

    Also, here's a simpler way to look at the problem. You have three kinds of lines:

    • Comments (m/^c/)
    • Initial m-block lines (m/^m/)
    • Continuation lines (m/^\s/)
    You can write a simple "event processing" loop as follows:
    while (<$FILE>) { if (m/^m/) { ... } elsif (m/^\s/) { ... } elsif (m/^c/) { ... } else { die "whoops - didn't expect: $_" } }
    Now you just have to work out what to do in each of the three cases. My suggestion is to create a variable to represent the current m-block being recognized. The actions for each type of line would go something like this:
    • Continuation line: add the line to the current block.
    • Initial m-block line: process the current block, reset it and then add the line to the current block
    • Comments: process the current block and reset it
    Also, at the end of the loop you'll have to see if there's a block that needs to be processed.
Re: Skipping data on file read
by injunjoel (Priest) on Jun 13, 2008 at 21:56 UTC
    Greetings, Though this does not solve your problem think of it as a different approach to it. I would suggest setting the input record separator to a different value to get your file to be read in differently. Assuming your example file is representative of your input...
    #!/usr/bin/perl -w use strict; #of course this assumes the example file is in the exact #format expected. The following line is basically the #line containing a lone 'c' at the beginning and a whole #bunch of spaces (your comment cards). #remember that $/ can't handle regexp... though it sure #would be cooler if it did. local $/ = "c + "; while (<DATA>){ print "\n=====chunk start=======\n"; print $_; print "\n=====chunk stop=======\n"; } #from the example data you gave. __DATA__ m8 92234.30c 0.0071 92235.30c 0.9300 92238.30c 0.06289 + 8016.30c 2.0 42000.30c 2.5 + c + c BeO(2.86) Axial Reflector TD=3.01 / 95%=2.86 + m9 4009.30c 0.5 8016.30c 0.5 + mt9 beo.01t + c + c BeO(?AllgenCalc) Radial Reflectpr TD=3.01 / 95%=2.86 + m10 4009.30c 0.5 8016.30c 0.5 + mt10 beo.01t + c + c He/Xe(.0218) (72/28) ~.55 mol/L at 300K,1.38MPa, 39.6 g/mol + m11 2004.30c 0.7 + 54124.30c 0.00027 54126.30c 0.00027 54128.30c 0.00576 + 54129.30c 0.07932 54130.30c 0.01224 54131.30c 0.06354 + 54132.30c 0.08067 54134.30c 0.03132 54136.30c 0.02661 + c + c Sodium(0.929) RoomTemp = .97 g/cc, at melt = .929 g/cc + c Liquid = .929 - .000244*(t-371) (t in K) Handbook Ch&Ph + m12 11023.30c 1.0 $ Na (.929 g/cc) frozen/voi +d c + c Lithium(.515) RoomTemp = .534 g/cc, at melt = .515 g/cc + c Liquid = .515 - .000101*(t-454) (t in K) Handbook Ch&Ph
    So essentially you can get the file read in in chunks and deal with each chunk with your regexp. Oh and here is the output from above.
    =====chunk start======= m8 92234.30c 0.0071 92235.30c 0.9300 92238.30c 0.06289 + 8016.30c 2.0 42000.30c 2.5 + c + =====chunk stop======= =====chunk start======= c BeO(2.86) Axial Reflector TD=3.01 / 95%=2.86 + m9 4009.30c 0.5 8016.30c 0.5 + mt9 beo.01t + c + =====chunk stop======= =====chunk start======= c BeO(?AllgenCalc) Radial Reflectpr TD=3.01 / 95%=2.86 + m10 4009.30c 0.5 8016.30c 0.5 + mt10 beo.01t + c + =====chunk stop======= =====chunk start======= c He/Xe(.0218) (72/28) ~.55 mol/L at 300K,1.38MPa, 39.6 g/mol + m11 2004.30c 0.7 + 54124.30c 0.00027 54126.30c 0.00027 54128.30c 0.00576 + 54129.30c 0.07932 54130.30c 0.01224 54131.30c 0.06354 + 54132.30c 0.08067 54134.30c 0.03132 54136.30c 0.02661 + c + =====chunk stop======= =====chunk start======= c Sodium(0.929) RoomTemp = .97 g/cc, at melt = .929 g/cc + c Liquid = .929 - .000244*(t-371) (t in K) Handbook Ch&Ph + m12 11023.30c 1.0 $ Na (.929 g/cc) frozen/voi +d c + =====chunk stop======= =====chunk start======= c Lithium(.515) RoomTemp = .534 g/cc, at melt = .515 g/cc + c Liquid = .515 - .000101*(t-454) (t in K) Handbook Ch&Ph + =====chunk stop=======
    Does that make sense?

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: Skipping data on file read
by hangon (Deacon) on Jun 14, 2008 at 01:00 UTC

    If I've correctly interpreted what you're looking for, and there's no odd cases not shown in your sample data, this can be greatly simplified. Just grab each line, throw out the junk, then pull off your data with regexes. For example:

    my @data; while (my $line = <$FILE>){ next if $line =~ /^c/; ($line, undef) = split /\$/, $line, 2; push @data, $line =~ /\d+\.30c/g; push @data, $line =~ /\w+\.01t/g; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://691817]
Approved by pc88mxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-26 04:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found