Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Search file for certain lines

by Jalcock501 (Sexton)
on Sep 23, 2013 at 09:28 UTC ( [id://1055252]=perlquestion: print w/replies, xml ) Need Help??

Jalcock501 has asked for the wisdom of the Perl Monks concerning the following question:

Hey Guys I am in need of the monks once more. I am trying to write a script that search a file and reads from where a line begins with h(lowercase) up to the next line beginning with h(again lowercase) however there are several instances of this (at least 200) I need to read each individual case and check for any lines in between the h's that start with j(lowercase) E(uppercase) and G(again uppercase)and then output the lines beginning with j E and G. However I want to do this myself, but if someone could point me in the right direction on where and how to start this. I would greatly appreciate and advise/assistance. Thanks

Replies are listed 'Best First'.
Re: Search file for certain lines
by jethro (Monsignor) on Sep 23, 2013 at 09:44 UTC

    Use a state machine. A state machine is a loop and an integer variable that holds the state. In the loop you read one line and do things depending on state, one of which might be changing the state

    In your relatively simple case it seems you just would have two states, not between 'h's (lets call it state 0) and between 'h's (state 1). If you encounter a line beginning with 'h' simply flip the state variable

Re: Search file for certain lines
by Eily (Monsignor) on Sep 23, 2013 at 09:49 UTC

    It would be easier to give you advice if you told us what you already know and thought of. Right now I'll just try and guess what I should tell you and what you already know.

    Your conditions on the lines sound like a job for regular expressions. And there kind of is a "from .. until" operator in Perl, which is the flip flop operator, which would allow you to do something like (check the next keyword by the way) :

    while(<INPUT>) { next unless /^h/ .. /^h/; # stop processing the line unless we are b +etween two lines starting with h somecode(); }
    Which is nice and short, but not well known and understood, so if it looks to hard to understand to you, or if there probably will be a lot of people reading your code, you could use classic control structures : if, else, unless.

Re: Search file for certain lines
by mtmcc (Hermit) on Sep 23, 2013 at 09:48 UTC
    I'm not entirely sure what you're trying to do. Surely, all lines in your file will be between two lines beginning with h, aside from those before the first line beginning with h, and after the last line beginning with h?

    It might make it easier to understand the problem if you give some sample data, and the result you expect to get. Have a look through this: How do I post a question effectively?

    Finally, based on your question, my best guess would be something like this:

    #!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0]; my $check = 0; my @line; while (<DATA>) { @line = split('', $_); $check += 1 if $line[0] eq 'h'; if ($check%2 == 1) { if (($line[0] eq 'j') || ($line[0] eq 'E') || ($line[0 +] eq 'G')) { print STDOUT "$_"; } } } __DATA__ aaaaa 1 bbbbb 2 hhhhh 3 fffff 4 rrrrr 5 lllll 6 jjjjj 7 HHHHH 8 EEEEE 9 GGGGG 10 hhhhh 11 jjjjj 12 EEEEE 13 GGGGG 14

    I hope that helps.
Re: Search file for certain lines
by hdb (Monsignor) on Sep 23, 2013 at 11:46 UTC
    use strict; use warnings; my $state = 0; while( <DATA> ) { $state = 1-$state if /^h/; print if $state && /^[jEG]/; } __DATA__ h132BIK2 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** j1000010017 6790194100109201301092013Test Data N PW09-3PY248 +018BIK20 k10 2R 1 0045.1011N01010215.820012.220006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000215.82000012.22000006.00 q0215.820215.820215.820215.820215.820000000000000000000002500250025002 +500250YY00000 01000215.82000215.82000215.82000215.82000215.82 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0043.160019.0110018.9909M0000.000010.000233.08N0017.270023.500000 +43.16000019.01000018.99000000.00000010.00000233.08000017.27 h216BIK0 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** pMU76 Nov 2010 A B C D E F G H + I J L + + 0000000000 j1000010017 6790194100109201301092013Test Data M PW09-3PY248 +005BIK00 k10 2R 1 0045.1011N01010217.190012.290006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000217.19000012.29000006.00 q0217.190217.190217.190217.190217.190000000000000000000002500250025002 +500250YY00000 01000217.19000217.19000217.19000217.19000217.19 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0043.440019.1410019.1109M0000.000010.000234.57N0017.380023.500000 +43.44000019.14000019.11000000.00000010.00000234.57000017.38 h217BIK1 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** pMU76 Nov 2010 A B C D E F G H + I J L + 0000000000 j1000010017 6790194100109201301092013Test Data L PW09-3PY248 +006BIK10 k10 2R 1 0045.1011N01010222.940012.620006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000222.94000012.62000006.00 q0222.940222.940222.940222.940222.940000000000000000000002500250025002 +500250YY00000 01000222.94000222.94000222.94000222.94000222.94 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0044.590019.6110019.6209M0000.000010.000240.78N0017.840023.500000 +44.59000019.61000019.62000000.00000010.00000240.78000017.84
Re: Search file for certain lines
by Anonymous Monk on Sep 23, 2013 at 09:51 UTC
Re: Search file for certain lines
by wjw (Priest) on Sep 23, 2013 at 09:51 UTC
    ..so if I read this right, you are not interested in lines that begin with h(lower case). You are only interested in outputting lines that start with j, E, or G. If that is in fact the case, then the problem should be pretty simple. Read the file into an array, loop through the array line by line using a regex to check for lines beginning with j, E or G and print them out.

    I get the impression I am missing something here... What importance do the lines beginning with h(lower case) have to you?

    • ...the majority is always wrong, and always the last to know about it...
    • The Spice must flow...
    • ..by my will, and by will alone.. I set my mind in motion

      Read the file into an array,

      Why?

        ..it generally works for me. I like iterating through an array. The question I face rarely is "why not", which boils down to file size in those rare cases. It is easy to see what I am working with using the debugger when I have an array available with everything in it. I am used to working with arrays.... Mostly this is just personal preference, but it works for me, so I suggest it...
        • ...the majority is always wrong, and always the last to know about it...
        • The Spice must flow...
        • ..by my will, and by will alone.. I set my mind in motion
Re: Search file for certain lines
by Jalcock501 (Sexton) on Sep 23, 2013 at 10:08 UTC
    Hey Guys Sorry it's so ambiguous. Here is some example data.
    h132BIK2 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** j1000010017 6790194100109201301092013Test Data N PW09-3PY248 +018BIK20 k10 2R 1 0045.1011N01010215.820012.220006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000215.82000012.22000006.00 q0215.820215.820215.820215.820215.820000000000000000000002500250025002 +500250YY00000 01000215.82000215.82000215.82000215.82000215.82 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0043.160019.0110018.9909M0000.000010.000233.08N0017.270023.500000 +43.16000019.01000018.99000000.00000010.00000233.08000017.27 h216BIK0 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** pMU76 Nov 2010 A B C D E F G H + I J L + + 0000000000 j1000010017 6790194100109201301092013Test Data M PW09-3PY248 +005BIK00 k10 2R 1 0045.1011N01010217.190012.290006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000217.19000012.29000006.00 q0217.190217.190217.190217.190217.190000000000000000000002500250025002 +500250YY00000 01000217.19000217.19000217.19000217.19000217.19 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0043.440019.1410019.1109M0000.000010.000234.57N0017.380023.500000 +43.44000019.14000019.11000000.00000010.00000234.57000017.38 h217BIK1 u3*** TEST DATA *** u3*** COMMENT AREA FOR TEST DATA *** pMU76 Nov 2010 A B C D E F G H + I J L + 0000000000 j1000010017 6790194100109201301092013Test Data L PW09-3PY248 +006BIK10 k10 2R 1 0045.1011N01010222.940012.620006.0000000 0250M 1I +nsured Only NYY01N00000.00N00000.00Y00000. +00 000222.94000012.62000006.00 q0222.940222.940222.940222.940222.940000000000000000000002500250025002 +500250YY00000 01000222.94000222.94000222.94000222.94000222.94 l02001 0400000000 +0000000000000000000000000000000000000000 a000.00000.00000.0000 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 yIIDD0044.590019.6110019.6209M0000.000010.000240.78N0017.840023.500000 +44.59000019.61000019.62000000.00000010.00000240.78000017.84
    This is a small portion but it should give you the idea. As you can see there is some other information between lines that I need. To clarify further I do not need the lines beginning with h, they are just markers to search between.
      ahhh... So there are not h-starts and h-stops. h(lower case) simply indicates the beginning of a record. My approach would be to find h line and use is as the key in a hash, then iterate until finding each of the subsequent j,E,G lines and adding them as an array under the key(HoA). When you find another h-line, you make a new key and wash,rinse,repeat until end of data. You end up with a structure which is pretty easy to continue to manipulate and do further clean up on if you want to.

      I am sure there are a bunch of other ways to do this, many better. But what I like about this approach is that it uses well documented data structures which are simple to manipulate. Good luck! Should be fun... :-)

      • ...the majority is always wrong, and always the last to know about it...
      • The Spice must flow...
      • ..by my will, and by will alone.. I set my mind in motion
      What output do you expect to get from that data?

      mmmh... change the input field separator $/?

      or slurp the entire file and split by ^h?

Re: Search file for certain lines
by kcott (Archbishop) on Sep 24, 2013 at 05:33 UTC

    G'day Jalcock501,

    You can read your input as multiline blocks: see '$/' in perlvar. Each of the individual lines in those blocks can be matched for the starting characters you want: see the 'g' and 'm' modifiers, the '^' and '$' anchors and '[...]' character classes in perlre.

    With the sample input you provided, this code:

    #!/usr/bin/env perl -l use strict; use warnings; use autodie; my $re = qr{^([jEG].*)$}m; { local $/ = "\nh"; open my $fh, '<', 'pm_1055252_data.txt'; while (<$fh>) { print "*** h-block #$."; print $1 while /$re/g; } close $fh; }

    produces this output:

    *** h-block #1 j1000010017 6790194100109201301092013Test Data N PW09-3PY248 +018BIK20 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 *** h-block #2 j1000010017 6790194100109201301092013Test Data M PW09-3PY248 +005BIK00 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708 *** h-block #3 j1000010017 6790194100109201301092013Test Data L PW09-3PY248 +006BIK10 E99HEADER|004|001| E99INSSCH|248| E99POLCOM|3||CAP01|66|3301R7435459||||| E99INSFAC2|MSRA01_1||||||"LNI10708"| G3301R7435459:LNI10708

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1055252]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2024-03-28 18:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found