Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Untangling Log Files

by loris (Hermit)
on Feb 08, 2007 at 12:01 UTC ( [id://598981]=perlquestion: print w/replies, xml ) Need Help??

loris has asked for the wisdom of the Perl Monks concerning the following question:

Hello Knowledgable Ones,

I have around 40 logfiles of about 15 MB each. Around 30 processes write willy-nilly into these files, whereby each line contains text which identifies the process. I would like to untangle the log files to produce a single file for each process.

Naively I could try to slurp all the logfiles and then just use grep or split to get the process ID and then write to the appropriate new log file. However, I suspect that I might have some memory issues slurping all the data, but apart from that I would like to know what would be a more scalable approach.

Any ideas?

Thanks,

loris


"It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)

Replies are listed 'Best First'.
Re: Untangling Log Files
by davorg (Chancellor) on Feb 08, 2007 at 12:14 UTC

    You shouldn't slurp files unless you actually need all of the file in memory at the same time. Your normal action should be to process the file a record at a line.

    I'd do something like this:

    my %fh; # store handles of new log files open OLDLOG, $path_to_old_log or die $!; while (<OLDLOG>) { my $proc = extract_process_id_from_old_log_record($_); unless (exists $fh{$proc}) { open $fh{$proc}, '>', "$proc.log" or die $!; } print $fh{$proc} $_; }

    Update: Reread the first line. It made no sense, so I fixed it.

      open $fh{$proc}, '>', "$proc.log" or die $!;

      Uhm... I don't care for that. It's very likely that there will be more pids in each file than open descriptors permitted by resource limits... and if you fail to open you die, probably half-done and with no way to pick up where you left off.

      I don't have an immediate fix though someone else suggested closing one handle randomly which would, I guess, work. (So long as you changed your open to '>>' and remembered to delete it from your hash.)

      Personally, I'd probably take a less elegant—call it more braindead—approach as this seems to be a one-off thing anyway, and just close the filehandle and open a new one whenever the pid changed from the previous record.

      Update: Well, I just re-read the OP and now I think I may have misinterpreted the bit about "30 processes" the first time around. If there are only 30 pids in the log files, then I like your approach just fine and my criticisms are all moot.

      -sauoq
      "My two cents aren't worth a dime.";
Re: Untangling Log Files
by jettero (Monsignor) on Feb 08, 2007 at 12:11 UTC
    I like this problem.

    Personally, I would keep a hash of filehandles, but close one at random if there are too many open (hoping the pids are kinda grouped together in the logfile).

    -Paul

      ++jettero

      FYI, the core module FileCache will automate the close/re-open of filehandles.

      I like the problem, too, and I certainly like your solution. Randomly closing file handles seems like an nice way of making it scalable.

      Thanks,

      loris


      "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)
Re: Untangling Log Files
by Moron (Curate) on Feb 08, 2007 at 12:47 UTC
    If the requirement is continuous, you'll need some kind of daemon (perhaps invoked at system startup) to pick up new appendages to the logfiles shortly after they arrive. Lets also assume that messages have a timestamp, otherwise duplicate events separated only in time would be indistinguishable.

    To allow for reboot of the system, the daemon will need to keep track of the timestamp of the last message it collated for each machine writing messages to the logfiles (in case their clocks are out of synch.)

    There also needs to be a structure of regular expressions that enables not just identification of the originating process but of the timestamp which needs to be converted into a delta time for comparison. In a dynamic environment this might best be achieved using a csv configuration file e.g.:

    PROCESS,HOST,LOGFILE,FORMAT,$1,$2 foo,host99,/var/adm/foo.log,\s+\S+\s+(\S+)\s+(\d+\-\d+\-\d+\s\d+:\d+:\ +d+:\s\w{2}),PROC,TIMESTAMP
    Once all that is sorted out there still remains the routine work for the daemon of reading in the config file, reading in the timestamp tracker file (one line per host), for each file (only one filehandle needed!) matching lines of logfiles against the configured regexps and ignoring entries prior to the timestamp for the host, updating the per-process file and the journal file with the latest timestamp (plus originating host) of a message just transferred to the per-process file.

    It also needs to sleep perhaps five minutes between cycles through all the log files to free system resources for other processes.

    Update: a common practice is also to routinely archive and delete logfiles (yet another logfile management daemon!) so that such reprocessing doesn't have to start from the beginning of a very large logfile, and then have to read but ignore millions of entries occurring before the last recorded timestamp. One system I work with regularly archives logfiles when they hit 5 MB instead of by time or line count. It might be convenient for your requirement if the message-collating daemon could also (per cycle) check the size and conditionally do or invoke that archiving itself.

    -M

    Free your mind

Re: Untangling Log Files
by kwaping (Priest) on Feb 08, 2007 at 17:36 UTC
    Non-perl answer: Personally, I would use the unix grep command to extract the desired lines out of the log files, then redirect that output into another file, possibly with a sort wedged in the middle. This is on a unix box though, not sure what your OS is or if there's an equivalent set of commands for it.

    Something like this:
    grep -h process_identifier *.log | sort -options > process_identifier. +newlog

    ---
    It's all fine and dandy until someone has to look at the code.

      Your suggestion is pretty much what I am doing already, but I want to automate the process a bit more so that I can generate the all the process-specific log files at once. So I think I shall try something like jettero's solution.

      Thanks anyway,

      loris


      "It took Loris ten minutes to eat a satsuma . . . twenty minutes to get from one end of his branch to the other . . . and an hour to scratch his bottom. But Slow Loris didn't care. He had a secret . . ." (from "Slow Loris" by Alexis Deacon)
        I see. Well, in "quick and dirty" style, you could wrap that system call in a foreach my $process ('this','that','the other') { ... } loop.

        ---
        It's all fine and dandy until someone has to look at the code.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://598981]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-18 06:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found