Not only does using /x make things a lot more readable, it also helps with debugging. By commenting out everything except the first element in the final regex, it allowed me to adjust that until it worked for all (both:) test lines. Then I uncommented the next element and adjusted that and so on until the whole thing matched.
Using named sub elements allows you to re-use thise bits where necessary and would simplify adding in predefined elements like a better IP definition from regexp::Common or a datetime from somewhere.
#! perl -slw
use strict;
my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2}
+ ]x;
# Aug 21 19:00:36
my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \] /x;
# [1.1.1.3.200.125]
my $re_msgid = qr[ \d{6} : ]x; # 41
+0381:
my $re_TZ = qr[ [A-Z]{3} : ]x; # UT
+C:
my $re_type = qr[ %SEC-6- [A-Z]+ : ]x;
# %SEC-6-IPACCESSLOGP:
my $re_listid = qr[ list \s (\d+) ]x; # li
+st 101
my $re_action = qr[ [a-z]+ ]x; # de
+nied
my $re_protocol = qr[ [a-z]+ ]x; # tc
+p
my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10
+.161.24.153
my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3
+988) or (8/0)
my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # ,
+1 packet
my $re_log = qr[
^
( $re_datetime ) \s+
( $re_MIB ) \s+
( $re_msgid ) \s+
( $re_datetime) \s+
( $re_TZ ) \s+
$re_type \s+
$re_listid \s+
( $re_action ) \s+
( $re_protocol ) \s+
( $re_ip ) \s*
$re_port? \s+
-> \s+
( $re_ip ) \s*
$re_port?
$re_packets \s*
$
]x;
while( <DATA> ) {
print join'|', $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12,
+$13
if $_ =~ m[$re_log];
}
=pod output
P:\test>285616
Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den
+ied|tcp|10.161.24.153|3988|10.158.24.10|135|1
Use of uninitialized value in join or string at P:\test\285616.pl8 lin
+e 37, <DATA> line 2.
Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den
+ied|icmp|10.165.4.150||211.95.79.233|8/0|1
=cut
__DATA__
Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10
+(135), 1 packet
Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/
+0), 1 packet
Note that the second line produces an "uninitialised value" warning for the second line. This is because that line has no port number after the first IP number. This will result in all the capture numbers thereafter being shifted, which is a pain.
The best way I know of to avoid all the conditionals and stuff required to deal with regexes that contain conditional captures is to capture to named variables using (?{ }) extended regex feature.
#! perl -slw
use strict;
use re 'eval';
# Aug 21 19:00:36
my $re_datetime = qr[ [A-Z] [a-z]{2} \s \d{2} \s \d{2} : \d{2} : \d{2}
+ ]x;
my $re_MIB = qr/ \[ \d (?: \. \d+ )+ \ # [1.1.1.3.200.125]
my $re_msgid = qr[ \d{6} : ]x; # 41
+0381:
my $re_TZ = qr[ [A-Z]{3} : ]x; # UT
+C:
my $re_type = qr[ %SEC-6- [A-Z]+ : ]x; #%SEC-6-IPACCESSLOGP:
my $re_listid = qr[ list \s (\d+) ]x; # li
+st 101
my $re_action = qr[ [a-z]+ ]x; # de
+nied
my $re_protocol = qr[ [a-z]+ ]x; # tc
+p
my $re_ip = qr[ \d+ (?: \. \d+ ){3} ]x; # 10
+.161.24.153
my $re_port = qr[ \( (\d+ (?: / \d+ )? ) \) ]x; # (3
+988) or (8/0)
my $re_packets = qr[ , \s+ ( \d+ ) \s+ packet ]x; # ,
+1 packet
my $re_log = qr[
^
( $re_datetime ) \s+ (?{ $first_date = $^N||'' })
( $re_MIB ) \s+ (?{ $MIB = $^N||'' })
( $re_msgid ) \s+ (?{ $msgID = $^N||'' })
( $re_datetime) \s+ (?{ $second_date = $^N||'' })
( $re_TZ ) \s+ (?{ $TZ = $^N||'' })
$re_type \s+
$re_listid \s+ (?{ $listID = $^N||'' })
( $re_action ) \s+ (?{ $action = $^N||'' })
( $re_protocol ) \s+ (?{ $protocol = $^N||'' })
( $re_ip ) \s* (?{ $ip1 = $^N||'' })
$re_port? \s+ (?{ $port = $^N||'' })
-> \s+
( $re_ip ) \s* (?{ $ip2 = $^N||'' })
$re_port? (?{ $port2 = $^N||'' })
$re_packets \s* (?{ $packets = $^N||'' })
$
]x;
while( <DATA> ) {
our( $first_date, $MIB, $msgID, $second_date, $TZ, $listID,
$action, $protocol, $ip1, $port, $ip2, $port2, $packets );
print join'|', $first_date, $MIB, $msgID, $second_date, $TZ, $list
+ID,
$action, $protocol, $ip1, $port, $ip2, $port2, $pac
+kets
if $_ =~ m[$re_log];
}
=pod output
P:\test>285616
Aug 21 19:00:36|[1.1.1.3.200.125]|410381:|Aug 21 23:00:35|UTC:|101|den
+ied|tcp|10.161.24.153|3988|10.158.24.10|135|1
Aug 21 19:00:36|[1.1.1.3.200.125]|410382:|Aug 21 23:00:35|UTC:|101|den
+ied|icmp|10.165.4.150|10.165.4.150|211.95.79.233|8/0|1
=cut
__DATA__
Aug 21 19:00:36 [1.1.1.3.200.125] 410381: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGP: list 101 denied tcp 10.161.24.153(3988) -> 10.158.24.10
+(135), 1 packet
Aug 21 19:00:36 [1.1.1.3.200.125] 410382: Aug 21 23:00:35 UTC: %SEC-6-
+IPACCESSLOGDP: list 101 denied icmp 10.165.4.150 -> 211.95.79.233 (8/
+0), 1 packet
Which I like because it avoids the capture variable shuffling and if you start using this approach consistantly, it becomes pretty much second nature to build regexes this way. The downsides are the "experimental" status of the "zero-width evaluation asserion" (Phew! What a handle:) and the need to use re 'eval'; both of which are frowned upon in some circles.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.
|