Splitting squid log lines with perl

blm has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Splitting squid log lines with perl by BrowserUk (Patriarch) on Sep 16, 2002 at 08:04 UTC
My best guess would be that there is a utf multibyte char between the '''s, and vim is displaying the glyph associated with the first byte ascii value 128+ and that emacs recognises that it has a utf char, but the installed font doesn't have a glyph for the value, so its displaying a ? instead. Try opening the scriptfile with a hex editor (or piping it through a bin 2 hex filter and see what value you have between the ''s at that point in the code. Short odds that it is actually 2 or 3 bytes rather than a single byte char. Well It's better than the Abottoire, but Yorkshire!	[reply] [d/l]
Re: Splitting squid log lines with perl by blm (Hermit) on Sep 16, 2002 at 02:00 UTC
At the risk of getting people to -- this node :-( I must post this. Contrary to what I previously thought and said in this post, when calamaris does `@cache = split(u)` it is for some other purpose distinct from what I thought which was to split individual lines on `/\s+/` Nevertheless this line is there in calamaris and it looks different depending on what editor I use.	[reply] [d/l] [select]
Re: Splitting squid log lines with perl by kabel (Chaplain) on Sep 16, 2002 at 05:16 UTC
from a similar script: `my ($timestamp, $ip, $result, $bytes) = (split) [0,2..4];` [download]	[reply] [d/l]
Re: Splitting squid log lines with perl by Aristotle (Chancellor) on Sep 16, 2002 at 09:34 UTC
Put your cursor on the `u` in vim and type `ga` (mnemonic: "get ascii") to have the character code displayed in the status line in a number of formats. Can you also post a short(!) sample of Squid log lines? What's important is to pay attention to whether any of the fields can have embedded whitespace - in that case you have to do more precise work than just simply splitting. Makeshifts last the longest.	[reply]
Re: Re: Splitting squid log lines with perl by blm (Hermit) on Sep 16, 2002 at 11:13 UTC
By typing ga while the cursor was positioned over the micro in `@cache = split 'µ';` I get `<µ> <\|5> <M-5> 181, Hex b5, Octal 265` down the bottom (in the ruler?) So that makes it a byte of value 0xb5? Anyway my squid logs look like this: `1031902298.709 609 10.0.14.117 TCP_MISS/302 376 GET http://ad.doubl +eclick.net/ad/max.starwarskids/ros;sz=468x60;num=443509536434963200 f +red DIRECT/204.253.104.95 -` [download] There are one or more spaces between feilds (OT: cut -f2 -d' ' doesn't work :-( ). I was using: `while (<LOG>) { @line_elements = split(' '); ... }` [download] but it seems to work better with `@line_elements = split(/\s+/);` [download] Is this bad? `\s` is whitespace (tabs as well)? I am actually reading the Friedl book (Mastering Regular Expressions) atm.	[reply] [d/l] [select]
Re^3: Splitting squid log lines with perl by Aristotle (Chancellor) on Sep 16, 2002 at 11:37 UTC
That would be `0xB5`, yes. I have no idea how one arrives at using that as a separator though.. If there are one or more spaces between fields, but none inside fields, then `/\s+/` is indeed what you want to use and probably better than `' '` which is a special case. It means almost the same as `/\s+/` - with a subtle difference. `#!/usr/bin/perl -wl use strict; sub joinprint { print join " ", map q/"$_"/, @_ } $_ = " blah blah"; joinprint split ' '; joinprint split /\s+/; __END__ "blah" "blah" "" "blah" "blah"` [download] The `split " "` will omit an empty initial field. perldoc -f split carefully points this out. I recommend you write `split /$char/` in the future, since that's what really happens to all literal strings other than the single blank. If you don't, you can easily confuse yourself with something like `split "."` which is the same as `split /./` and as such most certainly not what you wanted. Makeshifts last the longest.	[reply] [d/l]
Re: Splitting squid log lines with perl by fsn (Friar) on Sep 16, 2002 at 10:41 UTC
I have little belief that different environments represent variable names differently, ie. `@line_elements` on the command line and `@cache` in vim. I also don't think emacs will rewrite statements and remove paranthesis, so I guess you are just giving examples rather than cutting-and-pasting , no? However, the mystery with the vanishing micro symbol is easy to explain, and has alot to do with history and old terminals. The micro symbol has an ASCII value of more than 127. Historically, shells would only present ASCII values from 32 to 127 (0-31 has special meanings, like LineFeed and stuff) because old terminals only used 7 bits when communicating over a serial line, so the 8th bit was stripped anyway. This was also dependant on how you opened the serial device, for example. Now, people like me, who stuff their national characters, like åäö, in high places over 127 didn't like this and made new terminals with support for full 8 bits. But 8 bits are not enough for representing all national characters around the world, so there are mechanisms for telling the shell what character mapping to use, and also which characters to print and not to print. Apparently, the shells you and I are using are configured not to print the micro symbol and therefore swallows it. Emacs, being what it is, is rather picky about these issues, much in the same way as the shell itself. So, it too swallows "unprintable" characters, but it seems to replace them with a '?' instead of just leaving them out. Vi(m) on the other is much more lenient in these issues, you can even load a binary file, patch the strings in it (as long as you keep the file size and don't go over the old strings bounds), save it and expect it to run. And, in most fonts, there is a graphic representation for each and every charcode, even the "unprintable" ones. So vi(m) happily sends each and every charcode(at least charcodes over 32) to the terminal, without the filtering of the shell. Conclusion: different environments filters or replaces some charcodes over 127, due to historical and/or internationalization issues. By changing LOCALE settings and stuff, you could probably force the shell to print the micro symbol also. But then, you are deep in the localisation tar pits of hell. Now, finally, on to your real problem. It's been a while since I looked at squid logs, but I seem to remember that it actually used some strange field separator, possibly the micro symbol. In that case, the split is actually meant to split lines on that symbol. But as I said, I never really hacked the log files myself.	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks