Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Splitting squid log lines with perl

by blm (Hermit)
on Sep 16, 2002 at 01:12 UTC ( [id://198128]=perlquestion: print w/replies, xml ) Need Help??

blm has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a perl program to accumulate certain information from a squid access file. When I run my script the bytes downloaded is about 600 MB but when I run calamaris I get 2Gig! While looking at calamaris I noticed that the split() to split log lines is different. I use

@line_elements = split(‘ ‘);

In calamaris it appears in vim to be

@cache = split(u)

where u appears to be the greek letter for micro (WEIRD!). If I open it in emacs it appears like:

@cache = split '?';

Can someone enlightenment me as to what is going on here? What would others use in split?

Replies are listed 'Best First'.
Re: Splitting squid log lines with perl
by BrowserUk (Patriarch) on Sep 16, 2002 at 08:04 UTC

    My best guess would be that there is a utf multibyte char between the '''s, and vim is displaying the glyph associated with the first byte ascii value 128+ and that emacs recognises that it has a utf char, but the installed font doesn't have a glyph for the value, so its displaying a ? instead.

    Try opening the scriptfile with a hex editor (or piping it through a bin 2 hex filter and see what value you have between the ''s at that point in the code.

    Short odds that it is actually 2 or 3 bytes rather than a single byte char.


    Well It's better than the Abottoire, but Yorkshire!
Re: Splitting squid log lines with perl
by blm (Hermit) on Sep 16, 2002 at 02:00 UTC

    At the risk of getting people to -- this node :-( I must post this.

    Contrary to what I previously thought and said in this post, when calamaris does

    @cache = split(u)

    it is for some other purpose distinct from what I thought which was to split individual lines on /\s+/

    Nevertheless this line is there in calamaris and it looks different depending on what editor I use.

Re: Splitting squid log lines with perl
by kabel (Chaplain) on Sep 16, 2002 at 05:16 UTC
    from a similar script:
    my ($timestamp, $ip, $result, $bytes) = (split) [0,2..4];
Re: Splitting squid log lines with perl
by Aristotle (Chancellor) on Sep 16, 2002 at 09:34 UTC

    Put your cursor on the u in vim and type ga (mnemonic: "get ascii") to have the character code displayed in the status line in a number of formats.

    Can you also post a short(!) sample of Squid log lines? What's important is to pay attention to whether any of the fields can have embedded whitespace - in that case you have to do more precise work than just simply splitting.

    Makeshifts last the longest.

      By typing ga while the cursor was positioned over the micro in

      @cache = split 'µ';

      I get  <µ>  <|5>  <M-5>  181,  Hex b5,  Octal 265 down the bottom (in the ruler?) So that makes it a byte of value 0xb5?

      Anyway my squid logs look like this:

      1031902298.709 609 10.0.14.117 TCP_MISS/302 376 GET http://ad.doubl +eclick.net/ad/max.starwarskids/ros;sz=468x60;num=443509536434963200 f +red DIRECT/204.253.104.95 -

      There are one or more spaces between feilds (OT: cut -f2 -d' ' doesn't work :-( ). I was using:

      while (<LOG>) { @line_elements = split(' '); ... }
      but it seems to work better with
      @line_elements = split(/\s+/);

      Is this bad? \s is whitespace (tabs as well)? I am actually reading the Friedl book (Mastering Regular Expressions) atm.

        That would be 0xB5, yes. I have no idea how one arrives at using that as a separator though..

        If there are one or more spaces between fields, but none inside fields, then /\s+/ is indeed what you want to use and probably better than ' ' which is a special case. It means almost the same as /\s+/ - with a subtle difference.

        #!/usr/bin/perl -wl use strict; sub joinprint { print join " ", map q/"$_"/, @_ } $_ = " blah blah"; joinprint split ' '; joinprint split /\s+/; __END__ "blah" "blah" "" "blah" "blah"
        The split " " will omit an empty initial field. perldoc -f split carefully points this out. I recommend you write split /$char/ in the future, since that's what really happens to all literal strings other than the single blank. If you don't, you can easily confuse yourself with something like split "." which is the same as split /./ and as such most certainly not what you wanted.

        Makeshifts last the longest.

Re: Splitting squid log lines with perl
by fsn (Friar) on Sep 16, 2002 at 10:41 UTC
    I have little belief that different environments represent variable names differently, ie. @line_elements on the command line and @cache in vim. I also don't think emacs will rewrite statements and remove paranthesis, so I guess you are just giving examples rather than cutting-and-pasting , no?

    However, the mystery with the vanishing micro symbol is easy to explain, and has alot to do with history and old terminals. The micro symbol has an ASCII value of more than 127. Historically, shells would only present ASCII values from 32 to 127 (0-31 has special meanings, like LineFeed and stuff) because old terminals only used 7 bits when communicating over a serial line, so the 8th bit was stripped anyway. This was also dependant on how you opened the serial device, for example. Now, people like me, who stuff their national characters, like åäö, in high places over 127 didn't like this and made new terminals with support for full 8 bits.

    But 8 bits are not enough for representing all national characters around the world, so there are mechanisms for telling the shell what character mapping to use, and also which characters to print and not to print. Apparently, the shells you and I are using are configured not to print the micro symbol and therefore swallows it.

    Emacs, being what it is, is rather picky about these issues, much in the same way as the shell itself. So, it too swallows "unprintable" characters, but it seems to replace them with a '?' instead of just leaving them out.

    Vi(m) on the other is much more lenient in these issues, you can even load a binary file, patch the strings in it (as long as you keep the file size and don't go over the old strings bounds), save it and expect it to run. And, in most fonts, there is a graphic representation for each and every charcode, even the "unprintable" ones. So vi(m) happily sends each and every charcode(at least charcodes over 32) to the terminal, without the filtering of the shell.

    Conclusion: different environments filters or replaces some charcodes over 127, due to historical and/or internationalization issues. By changing LOCALE settings and stuff, you could probably force the shell to print the micro symbol also. But then, you are deep in the localisation tar pits of hell.

    Now, finally, on to your real problem. It's been a while since I looked at squid logs, but I seem to remember that it actually used some strange field separator, possibly the micro symbol. In that case, the split is actually meant to split lines on that symbol. But as I said, I never really hacked the log files myself.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://198128]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2024-04-25 01:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found