Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Need advice in for perl use as awk replacement

by Anonymous Monk
on Sep 21, 2020 at 15:58 UTC ( #11122007=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

programming since very long and knowing quite a number of languages (like C/C++, python, java, javascript, ...), I am however very new to perl. The need for pŁerl came form a quite complex bash script which needs also some data mangling. I started with awk (as usually) but came to one point where AWK became very heavy to handle it. So I wanted to give perl a try.

The script (in this part) needs to analyze SYSLOG logs. With the help of goole and some tutorials I came up with:

perl -n -e " /^(([a-zA-Z]+\s+[0-9]+\s+[0-9]+[:][0-9]+[:][0-9]+)\s+([0-9a-zA-Z-]+)\s ++([^:[]+)(\[.*?\])?\s*[:]\s*(?:([^:]*)[:])?\s*(.*))\$/ && ( \$m=(\$1, +\$2,\$3,\$4,\$5,\$6,\$7) ) && print \$3 " "$SYSLOG"

which works fine for testing and printing the system name of each syslog record. But the script is to be more complex (actually it is a bash library function), so I need to filter. Also as a test I did then:

perl -n -e " /^(([a-zA-Z]+\s+[0-9]+\s+[0-9]+[:][0-9]+[:][0-9]+)\s+([0-9a-zA-Z-]+)\s ++([^:[]+)(\[.*?\])?\s*[:]\s*(?:([^:]*)[:])?\s*(.*))\$/ && ( $3 =~ /^s +ystemname$/) && print \$3 " "$SYSLOG"

But this does not work. I understand that this is because $3 is overwritten by this new regex testing. So I tryed to assign to a variable like:

perl -n -e " /^(([a-zA-Z]+\s+[0-9]+\s+[0-9]+[:][0-9]+[:][0-9]+)\s+([0-9a-zA-Z-]+)\s ++([^:[]+)(\[.*?\])?\s*[:]\s*(?:([^:]*)[:])?\s*(.*))\$/ && ( \$m=(\$1, +\$2,\$3,\$4,\$5,\$6,\$7) ) && ( $3 =~ /^systemname$/) && print \$m[3 +]" "$SYSLOG"

but get only empty lines as well.

Questions:

  1. am I understanding correctly that this scrip timplicitely doe s$_ =~ before the initial regex ?
  2. do I really need the first capturing group for the full line if I also need to acces the full line ($_ seems not to wrok) ?
  3. How do I access the groups of the first regex at the end print ?

Even if you think there is a better way, I do also want to understand why this way does not work as expected. In the final script there will be dynamic testing (so there can be additonal && ( $x =~....) &&

Many thanka,
Gaston

Replies are listed 'Best First'.
Re: Need advice for perl use as awk replacement
by hippo (Chancellor) on Sep 21, 2020 at 17:33 UTC

    Each successful match sets the numbered capture groups so you are correct in thinking that the second match you perform clobbers the $3 from the first. You are also correct in that storing the capture groups from the first match in some other structure would help. Where you have gone wrong is that you are storing references (see perlreftut) to the capture groups so you are back at square one. Instead you should store the values, eg:

    #!/usr/bin/env perl use strict; use warnings; my $string = 'foo bar baz quux'; $string =~ /(\w+) (\w+) (\w+) (\w+)/; print "Numbered groups: $1 $2 $3 $4\n"; my @matches = ($1, $2, $3, $4); print "\@matches has @matches\n"; $3 =~ /^baz/; print "Numbered groups after match 2: $1 $2 $3 $4\n"; print "\@matches after match 2: @matches\n";

    🦛

      Thanks, me storing references is very interesting. At first I did not see exectely where your code does difeer to mine in that point, simply because I did not know the perl reference/value syntax. I was expecting a python like approach, but was wrong.

      The interesting thing is that my script has \$1 .... which if I understand correctly would be references for python, but actually I user double quotes in th ebash script which requires me to pu \$1 to avoid $1 to be replaced by the shell. This I expect perl to see "$1,$2,$3..." without backslash, eg. value. Am I wrong here ?

      The reason I use double quotes is that later thea ctual additional filtering conditions will be injected via shell variable depending on the bash function call arguments.

      But I realize that actually my variable name for th earray "$m" was wrong and should be "@m". Cannot try now, but will give it a try tomorrow.

        actually I user double quotes in th ebash script which requires me to pu \$1 to avoid $1 to be replaced by the shell

        Ah yes, you are quite right. I have to agree with brother jcb that using a one-liner for this, especially with double quotes, is just asking for trouble and will be a maintenance nightmare. Just put it all in a script and then you won't have to worry about escaping everything. Eventually, when you become more familiar with Perl and what it can do you may see the benefit in replacing all of the bash parts with Perl.

        There's a place for one-liners but this doesn't look like that place to me.


        🦛

Re: Need advice in for perl use as awk replacement
by jcb (Vicar) on Sep 21, 2020 at 21:47 UTC

    This is an example where you should rewrite the entire shell script in Perl. Using Perl one-liners in shell scripts is asking for performance problems — perl's startup and shutdown overhead is negligible in an interactive context, but it can add up fast if you are running perl repeatedly for every line of input. Even Awk has a similar issue, where efficient programming requires sending a stream into Awk and reading either a brief summary or a stream of output back.

    To better understand Perl's implicit loops, try B::Deparse, used via O:

    $ perl -MO=Deparse -n -e 'print $_' LINE: while (defined($_ = <ARGV>)) { print $_; } -e syntax OK

    You should have the full line in $_ unless your code changes $_. As other monks have mentioned, you will need to copy the regex capture variables somewhere to preserve them, or take advantage of the fact that regex matches in list context return the subexpressions: my @fields = m/$line_regex/; note that matching against $_ is implicit in Perl if you do not use the =~ operator.

      This is an example where you should rewrite the entire shell script in Perl.
      Agreed. After being scarred by having to maintain an ugly 5000-line shell script (that started life as a quick ten line script) I put forward a case that Perl should almost always be preferred to Unix shell: Unix shell versus Perl.

      The key reason is that Perl can comfortably scale to much larger scripts than shell. And shell scripts have a way of growing ... and growing ... and growing ... until they become maintenance nightmares. But by then how do you justify a rewrite? The cost of rewriting, the opportunity cost of not working on something else, and the risk of breaking previously working code in the rewrite. So write it in Perl to begin with.

Re: Need advice in for perl use as awk replacement
by BillKSmith (Prior) on Sep 21, 2020 at 22:35 UTC
    The -n flag in your code creates an implied while loop which reads from STDIN. For the purpose of this post, I am using an explicit while loop. I have redirected STDIN to a memory file which contains the example data from SyslogScan::SyslogEntry. I saved your regex in a variable to separate the matching and processing issues. You had the right idea about saving the match values, but you should use an array. It is both easier and clearer to use 'eq' rather than an regex to match a constant pattern. Note that at the end of the loop, $_ still contains the original line and @m contains all the original matches. I recommend that you write and debug your complete logic in this form. If necessary, you can rewrite it as the terse one-liner.
    use strict; use warnings; use feature 'state'; my $syslog_names = \"Jun 13 02:32:27 satellife in.identd[25994]: connect from mail.mi +ssouri.edu\n"; close STDIN; open STDIN, '<', $syslog_names or die "$!"; while (<>) { state $systemname = 'in.identd'; state $regex = qr/^ ( ([a-zA-Z]+\s+[0-9]+ \s+ [0-9]+[:][0-9]+[:][0-9]+) \s+ ([0-9a-zA-Z-]+) \s ([^:[]+) (\[.*?\])? \s* [:] \s* (?: ([^:]*)[:])? \s* (.*) ) $ /x; if (m/$regex/) { my @m = ($1, $2, $3, $4, $5, $6, $7); print "$m[3]\n$_\n" if $m[3] eq "$systemname"; } }

    OUTPUT:

    in.identd Jun 13 02:32:27 satellife in.identd[25994]: connect from mail.missouri +.edu
    Bill
Re: Need advice in for perl use as awk replacement
by Marshall (Canon) on Sep 21, 2020 at 17:21 UTC
    It would be very helpful if you could post some input data and then explain the desired result.

      Hi, thanks for your reply. The input data is as I said the syslog message log file. But it is actually not important. The regex is not the problem here (I am comfortable in this part) it filters as expected. The problem is the following sequence of boolean operations. However it looks like site formatting added some spurious "+" (line breaks>) to it

      There is also currently no meaningfull result as this script will later be dynamic where depending on the callers parameter the syslog data is filtered and complete line or parts of it returned.

      My first try was to filter data and just output the system name (group $3, $1=complete line, $2=date/time, group $3=system name, $4=source $5=optional source PID, ...).

      The problem is how to correctly filter some of the groups and only print on positiv match, and why my examples are not working (for getting better understanding of perl).

      But here some sample data:

      Sep 21 19:18:25 tic kernel: hv_balloon: Balloon request will be partia +lly fulfilled. Balloon floor reached. Sep 21 19:23:10 tic crmd[1427]: notice: State transition S_IDLE -> S_ +POLICY_ENGINE Sep 21 19:23:10 tic pengine[1426]: warning: Fencing and resource manag +ement disabled due to lack of quorum
Re: Need advice in for perl use as awk replacement
by perlfan (Vicar) on Sep 21, 2020 at 18:46 UTC
    > need for pŁerl came form a quite complex bash script which needs also some data mangling

    good reason :) - this is generally my arc since I actually prefer using shell scripts for most simple things.

    As tempting as it is, I would not try to use perl like awk. Doing this in a more readable and expanded way will greatly assist you in your efforts. The /x flag will allow you to even break the regex up over multiple lines and with comments. Your future self and certainly others will appreciate this. Check out perlre for this (and more).

    You may also want to checkout Regexp::Common. It may have your use case; if not it's still good to study.

      I would not try to use perl like awk.

      Well, I do this quite frequently, but then (a) I am very fluent in Perl, and (b) I can't hardly write a line of awk without consulting the fm. (I used to be fluent in awk, but those parts of my brain have been taken over by Perl, like wasp maggots in a caterpillar.)

Re: Need advice in for perl use as awk replacement
by Anonymous Monk on Sep 22, 2020 at 06:01 UTC

    Thank you to all monks for the help provided!

    With your help I have identified the problem which actually was me trying to store away the captured groups into an aray $m instead of @m. Th eonly thing I do not get yet is why perl is not complaining with an error regarding a statement like "$m=($1,$2,$3,$4,$5,$6,$7)", What does this mean to perl if $ is used instead of @ for m ?

    Many thanks also for all indirect answers and informational hints like the deparser, and others. These did extend my horizont and pointed me to what to investigate next (like regex modifiers, etc...)

    One other question, is perldoc officially available somewhere in PDF ? Wikipedia claims that the perldoc site would have links for PDF download but I could not find these. All I could find (right now) is a PDF with the complete doc of perl 5.8.5 on perl.mines-albi.fr

    Many thanks,
    Gaston

      ... why perl is not complaining with an error regarding a statement like "$m=($1,$2,$3,$4,$5,$6,$7)",

      The following set of statements are syntactically correct:

      Win8 Strawberry 5.8.9.5 (32) Mon 09/21/2020 19:49:29 C:\@Work\Perl\monks >perl use strict; use warnings; my $foo = 42; my $bar = 137; my $m; $m = (1, 2, 3, $foo, 5, 6, $bar); Useless use of a constant in void context at - line 8. Useless use of a constant in void context at - line 8. Useless use of private variable in void context at - line 8. Useless use of a constant in void context at - line 8. Useless use of a constant in void context at - line 8. print $m; __END__ 137
      (1, 2, 3, $foo, 5, 6, $bar) is a list. Evaluating a list in scalar context (which is imposed by assignment to the scalar $m) causes the evaluation of every expression in the list, with only the result of the evaluation of the last item ($bar in this case) being returned. All other results are thrown away, which is why you see all the "Useless use of..." warnings — not errors! See the article Context tutorial in the Monastery's Tutorials section.


      Give a man a fish:  <%-{-{-{-<

      ę...perldoc officially available somewhere in PDF...Ľ

      You may take a look at pod2pdf and pod2latex.

      ęThe Crux of the Biscuit is the ApostropheĽ

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11122007]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2020-10-22 00:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (225 votes). Check out past polls.

    Notices?