Untangling Log Files

loris has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Untangling Log Files by davorg (Chancellor) on Feb 08, 2007 at 12:14 UTC
You shouldn't slurp files unless you actually need all of the file in memory at the same time. Your normal action should be to process the file a record at a line. I'd do something like this: `my %fh; # store handles of new log files open OLDLOG, $path_to_old_log or die $!; while (<OLDLOG>) { my $proc = extract_process_id_from_old_log_record($_); unless (exists $fh{$proc}) { open $fh{$proc}, '>', "$proc.log" or die $!; } print $fh{$proc} $_; }` [download] Update: Reread the first line. It made no sense, so I fixed it. -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply] [d/l]
Re^2: Untangling Log Files by sauoq (Abbot) on Feb 08, 2007 at 16:03 UTC
`open $fh{$proc}, '>', "$proc.log" or die $!;` Uhm... I don't care for that. It's very likely that there will be more pids in each file than open descriptors permitted by resource limits... and if you fail to open you die, probably half-done and with no way to pick up where you left off. I don't have an immediate fix though someone else suggested closing one handle randomly which would, I guess, work. (So long as you changed your open to '>>' and remembered to delete it from your hash.) Personally, I'd probably take a less elegant—call it more braindead—approach as this seems to be a one-off thing anyway, and just close the filehandle and open a new one whenever the pid changed from the previous record. Update: Well, I just re-read the OP and now I think I may have misinterpreted the bit about "30 processes" the first time around. If there are only 30 pids in the log files, then I like your approach just fine and my criticisms are all moot. `-sauoq "My two cents aren't worth a dime.";`	[reply] [d/l]
Re^3: Untangling Log Files by davorg (Chancellor) on Feb 08, 2007 at 16:08 UTC
Well, the original post said: Around 30 processes So I assumed I'd need about thirty files. And, yes, `die`ing like that is probably a bad idea. This is really only a sketch of a suggested approach. The devil is, of course, in the details :-) -- <http://dave.org.uk> "The first rule of Perl club is you do not talk about Perl club." -- Chip Salzenberg	[reply]
Re: Untangling Log Files by jettero (Monsignor) on Feb 08, 2007 at 12:11 UTC
I like this problem. Personally, I would keep a hash of filehandles, but close one at random if there are too many open (hoping the pids are kinda grouped together in the logfile). Read more... (848 Bytes) -Paul	[reply] [d/l]
Re^2: Untangling Log Files by Util (Priest) on Feb 08, 2007 at 15:53 UTC
++jettero FYI, the core module FileCache will automate the close/re-open of filehandles.	[reply]
Re^2: Untangling Log Files by loris (Hermit) on Feb 09, 2007 at 07:23 UTC
I like the problem, too, and I certainly like your solution. Randomly closing file handles seems like an nice way of making it scalable. Thanks, loris "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)	[reply]
Re: Untangling Log Files by Moron (Curate) on Feb 08, 2007 at 12:47 UTC
If the requirement is continuous, you'll need some kind of daemon (perhaps invoked at system startup) to pick up new appendages to the logfiles shortly after they arrive. Lets also assume that messages have a timestamp, otherwise duplicate events separated only in time would be indistinguishable. To allow for reboot of the system, the daemon will need to keep track of the timestamp of the last message it collated for each machine writing messages to the logfiles (in case their clocks are out of synch.) There also needs to be a structure of regular expressions that enables not just identification of the originating process but of the timestamp which needs to be converted into a delta time for comparison. In a dynamic environment this might best be achieved using a csv configuration file e.g.: `PROCESS,HOST,LOGFILE,FORMAT,$1,$2 foo,host99,/var/adm/foo.log,\s+\S+\s+(\S+)\s+(\d+\-\d+\-\d+\s\d+:\d+:\ +d+:\s\w{2}),PROC,TIMESTAMP` [download] Once all that is sorted out there still remains the routine work for the daemon of reading in the config file, reading in the timestamp tracker file (one line per host), for each file (only one filehandle needed!) matching lines of logfiles against the configured regexps and ignoring entries prior to the timestamp for the host, updating the per-process file and the journal file with the latest timestamp (plus originating host) of a message just transferred to the per-process file. It also needs to sleep perhaps five minutes between cycles through all the log files to free system resources for other processes. Update: a common practice is also to routinely archive and delete logfiles (yet another logfile management daemon!) so that such reprocessing doesn't have to start from the beginning of a very large logfile, and then have to read but ignore millions of entries occurring before the last recorded timestamp. One system I work with regularly archives logfiles when they hit 5 MB instead of by time or line count. It might be convenient for your requirement if the message-collating daemon could also (per cycle) check the size and conditionally do or invoke that archiving itself. -M Free your mind	[reply] [d/l]
Re: Untangling Log Files by kwaping (Priest) on Feb 08, 2007 at 17:36 UTC
Non-perl answer: Personally, I would use the unix grep command to extract the desired lines out of the log files, then redirect that output into another file, possibly with a sort wedged in the middle. This is on a unix box though, not sure what your OS is or if there's an equivalent set of commands for it. Something like this: `grep -h process_identifier *.log \| sort -options > process_identifier. +newlog` [download] --- It's all fine and dandy until someone has to look at the code.	[reply] [d/l]
Re^2: Untangling Log Files by loris (Hermit) on Feb 09, 2007 at 07:19 UTC
Your suggestion is pretty much what I am doing already, but I want to automate the process a bit more so that I can generate the all the process-specific log files at once. So I think I shall try something like jettero's solution. Thanks anyway, loris "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)	[reply]
Re^3: Untangling Log Files by kwaping (Priest) on Feb 09, 2007 at 12:58 UTC
I see. Well, in "quick and dirty" style, you could wrap that system call in a `foreach my $process ('this','that','the other') { ... }` loop. --- It's all fine and dandy until someone has to look at the code.	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks