Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Variable number of words/fields in a line/record

by Tuna (Friar)
on Jun 16, 2001 at 06:42 UTC ( [id://88997]=perlquestion: print w/replies, xml ) Need Help??

Tuna has asked for the wisdom of the Perl Monks concerning the following question:

Someone I know is writing a little script to parse a PIX fiewall log, grab elements from each line, and insert said elements into a database. Here's a sample of the data:
conduit permit tcp host 192.168.1.1 eq www any (hitcnt=57476) conduit permit tcp host 192.168.1.1 eq 139 host 192.168.2.1 (hitcnt=2)
What he wants to end up with is:
conduit permit tcp|host|192.168.1.1|eq|www|any|(hitcnt=57476) conduit permit tcp|host|192.168.1.1|eq|139|host|192.168.2.1|(hitcnt=2)
Forget about writing to the database, for now. The more immediate problem is getting the program to "create" variables on the fly, depending upon how many "elements" a line contains. Assuming that this is even the right way to approach this problem; I don't know of any other way. Hence, this post. =)
Here's the code h/we have come up with. Please contain the laughter, for now.
#!/usr/bin/perl -w use strict; my $in = "/home/trix/test"; my $out = "/home/trix/out"; use vars qw(@array @list); use vars qw ($host1 $host2 $conduit $permit $tcpudp $ip1 $ip2 $eq $port $hitcnt $line); open IN, "$in" || die "$!\n"; open OUT, ">$out" || die "$!\n"; while ($line = <IN>) { chomp $line; ($conduit, $permit, $tcpudp, $host1, $ip1, $eq, $port, $host2,$ip2 +, $hitcnt)=split( /\s/, $line); print OUT "$conduit $permit $tcpudp $host1\|$ip1\|$eq\|$port\|$hos +t2\|$ip2\|$hitcnt\n"; }
This prints:
conduit permit tcp|host|192.168.1.1|eq|www|any|(hitcnt=57476) conduit permit tcp|host|192.168.1.1|eq|139|host|192.168.2.1
I thought that this was going to be a pretty easy thing to accomplish (for me), but I must say that I'm stumped. And it's not even my work! But I'm posting this, because I really want to learn how to do this. It seems like this could be pretty useful stuff.

Thanks,
Tuna

Replies are listed 'Best First'.
Re: Variable number of words/fields in a line/record
by Arguile (Hermit) on Jun 16, 2001 at 09:12 UTC
    I think you should consider how you're modelling your data, as there are not a variable number of fields there (at least to my eyes). Let's look at just the part after "conduit permit", upon examining the data, we can break it down into five distinct pieces.
    conduit permit tcp host 192.168.1.1 eq www any (hitcnt=57476) |1| | 2 | |3| |4| | 5 | conduit permit tcp host 192.168.1.1 eq 139 host 192.168.2.1 (hitcnt=2) |1| | 2 | |3| | 4 | | 5 |
    So you don't have a variable number of fields, all cases can be represented as:
    $protocol, $server, $port, $client, $hits

    See the 'www' and 139 are no different; 'www' is just a label to port 80. As well, 'any' is just a special case of host aaa.bbb.ccc.ddd as it represents all the valid IPs (or host *).

    You might want to consider representing the ip as an ip/mask (decimal mask) in the database so the special case of 'any' can be easily represented in a not null manner. This will also help if your firewall allows designation by named IP groups and ranges for rulesets. If no data in any given field will be null (NOT NULL speced in table creation) many more indexing and relation options become open. You can then easily create lookup tables so that 'www' maps to '80', or an IP is mapped to a named person (ie. an admin or employee), or a whole IP range is named given your firewall supports named groups as stated before (if you want more info on db normalisation, the various relationship types and constraints feel free to /msg me and I'll bore you to death about them).

      conduit permit tcp host 192.168.1.1 eq www any (hitcnt=57476) |1| |2 | | 3 | |4||5| |6| | 7 | conduit permit tcp host 192.168.1.1 eq 139 host 192.168.2.1 (hitcnt=2) |1| |2 | | 3 | |4||5| |6| | 7 | | 8 |

        I don't think you're seeing what I mean. "host" is just a filler word, it has absolutely no bearing on the real data. "any" carries with it an implicit host, you could just as easily say "host any" or "host all". The root of it's meaning is "which host do I allow?" (based on the "permit" earlier).

        Think of it in another way, let's say you had a theme park and for different rides there were different height requirments. We could express this as:

        lane permit waterslide "The Slayer" eq 5 person 280cm (pplcnt=30) lane permit waterslide "The Trickle" eq 2 anyone (pplcnt=532)

        So we permit the use of "The Slayer" waterslide if they're a person over 280cm (and a little insane). In the second we permit the use of "The Trickle" to anyone who wants to. What I'm getting at here is that what you're taking as two pieces of information is in fact only one piece. If you go back and think through what the firewall is actually telling you, you realise that it's just different grammar that makes one longer.

        So to go back to primary problem:

        conduit permit tcp host 192.168.1.1 eq www any (hitcnt=57476) | 0 | |1| | 2 | |3| |4| | 5 | conduit permit tcp host 192.168.1.1 eq 139 host 192.168.2.1 (hitcnt=2) | 0 | |1| | 2 | |3| | 4 | | 5
        I've revised the diagram to just highlight the important data, all the rest is just packing material.
        • $permit is a boolean value (bit), either you permit or deny.
        • $protocol can be represented many ways, if you only deal in TCP and UDP -- and are tight on space -- you can represent it as a bit; a nicer datatype might be char(3) or char(4), depending on what other protocols you use, as it maps better for human understanding.
        • $server is an IP or possibly IP range. If your database supports internal ip/masks as a datatype then use that; if you plan to be doing a lot of indexing and/or matching/lookups on it, use an integer representation; or if all you really do is display it back, a varchar() would work as well.
        • $port has a range of [0..2^16) which is an unsigned small int (or int(2) to many dbs).
        • $client is an IP/mask again, as dicussed in server. Here though the mask comes more into play, as you can represents groups of computer (eg. "any") quite easily in that notation.
        • $hitcnt would be some form of integer unless you have insane daily traffic :)

        If I come across as heavy handed, please don't take it as such. I just think you're setting yourself up for way too much work and less robust reporting than you could achieve with a good foundation (data structure).

Re: Variable number of words/fields in a line/record
by premchai21 (Curate) on Jun 16, 2001 at 08:03 UTC
    You don't need all those separate variables. Try using an array or a hash instead, for instance:
    while (<IN>) # Note: using $_ instead { my (@two, @rest); chomp; @rest = split; # Note that this is the same as (split ' ', $_) . # ' ' is generally a better choice than /\s/ for sp +litting on # whitespace. If you need to count multiple spaces + as # multiple fields, use /\s/ or / / instead. @two = splice @rest, 0, 2; print join(' ',@two,undef),join('|',@rest),"\n"; # The undef is to + put another space in before @rest }
    See also splice split join perlop.
      Or (after chomp and before print):
      (@two[0..1], @rest) = split;
Re: Variable number of words/fields in a line/record
by I0 (Priest) on Jun 16, 2001 at 08:10 UTC
    ($conduit, $permit, $tcpudp, @host) = split( /\s/, $line); print OUT "$conduit $permit ",(join"|",$tcpudp,@host),"\n";
Re: Variable number of words/fields in a line/record
by virtualsue (Vicar) on Jun 16, 2001 at 18:21 UTC
    You seem to have received some useful help for the main part of your script, the bit that actually does the useful stuff. ++ to Arguile for his clear analysis of the your data & its effect on your program. So I'll just be pedantic and mention my pet peeve:
    open IN, "$in" || die "$!\n"; # Incorrect open OUT, ">$out" or die "$!\n"; # This is OK
    You must use or rather than || in short-circuit error checks after calls to open, close etc. unless you put parens around their parameters. If you don't, the die (or whatever is after the ||) will never be executed. See Re: (boo) debug-fu! for a slightly longer explanation.
Re: Variable number of words/fields in a line/record
by eejack (Hermit) on Jun 16, 2001 at 07:57 UTC
    You might want to try something like....
    #!/usr/bin/perl -w use strict; my $in = "/home/trix/test"; my $out = "/home/trix/out"; open IN, "$in" || die "$!\n"; open OUT, ">$out" || die "$!\n"; while (<IN>){ chop; my (@temp_array) = split( /\s/, $_); if ($#temp_array == 9){ #10 elements my @keeper_array = @temp_array[3 .. 9]; } else { # 9 elements my @keeper_array = (@temp_array[3 .. 7], "whatever your filler + is", $temp_array[8]); } }

    Update Misunderstood the basic question...

    But let's assume you are using dbi for a moment... You might have something like..

    $sql_statement = qq|insert into log_table (field1, field2, field3, field4, field5, field6) VALUES (?,?,?,?,?,?)|; $sth{'6'} = $dbh->prepare($sql_statement); $sql_statement = qq|insert into log_table (field1, field2, field3, field4, field5, field6, field7) VALUES (?,?,?,?,?,?,?)|; $sth{'7'} = $dbh->prepare($sql_statement); $sql_statement = qq|insert into log_table (field1, field2, field3, field4, field5, field6, field7, field8) VALUES (?,?,?,?,?,?,?,?)|; $sth{'8'} = $dbh->prepare($sql_statement);
    Then you while thing would do something like...
    while (<IN>){ chop; my ($waste, $wastea, @temp_array) = split; my $rv = $sth{$#temp_array+1}->execute(@temp_array); }

    Regardless of how you do it, you would need to be able to predict *something* about the incoming data either based on the number of elements or on something you could match in the data.

    EEjack

UPDATE: Variable number of words/fields in a line/record
by Tuna (Friar) on Jun 16, 2001 at 07:46 UTC
    Sorry to reply to my own post, but I think that I could state my problem a bit more clearly.

    How can I capture each "element" of a line, and store it in a variable, regardless of how many elements are in the line? So, if LINE_1 contains 9 elements, I need to create 9 variables. If it contains 10, I need to create 10...and so on.

    I know that if someone reads this, they're gonna ask what I'm really trying to do:

    The program will grab each element as described, ie:
    conduit permit tcp|host|192.168.1.1|eq|139|host|192.168.2.1
    then, it will ignore "conduit" and "permit", and insert the remaining variables into a database.
      Don't use variables, but an array or hash. Would do something like:
      while( <> ){ split; splice @_, 0, 2; #removes first two elements store_elements_function( @_ ); #sub to store the remainder }
      If you really want to name the items, use a hash. Define the names first:
      my @std_keys = qw/proto .../; my @ext_keys = qw/name9 name10 name11/; .. #and in loop: my %hash; if( @_ < 9 ){ @hash{@std_keys} = @_; } else { @hash{(@std_keys, @ext_keys)} = @_; } print "My proto is: ",$hash{'proto'},"\n"; ...
      Hope this helps. You can read up on it in perldata.

      Jeroen
      "We are not alone"(FZ)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://88997]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-25 07:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found