Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Okay, How can I create an array of filehandles? was a lot of help, since it shows how I can keep a bunch of files open at once (combined with a hash using filenames as keys, very cool).

Too bad it's not what I need. Let's back up, shall we?

I have a log file. From UDP port 161 (SNMP traps) to snmptrapd to syslog-ng and into a file. File looks roughly like this:

Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip1>: Trap msga. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip3>: Trap msgg. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip4>: Trap msgd. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip1>: Trap msge. Sep 28 19:45:10 logsrvr snmptrapd[<pid>]: <ip2>: Trap msga.

I have seven input files, some gzipped, some not. Since they're log files, I can use (stat "$filename")[9] to get the last modified time. Sort those to keep the log entries in order without having to mess with the timestamps in the log. Match /\]:\s(\S+):/ to get the IP address of the original trap sender.

Sounds easy, right? Here's the hard part: For each trap sender, I want to write an HTML file with only the traps for that sender. If there were only a few senders, I could just open the file, write the HTML 'top', add <pre>, then put the filehandle into a hash, and just write to the appropriate filehandle as the lines are parsed.

The problem is that there can be hundreds of original senders. Having that many filehandles open is certain to be problematic. The input data is about 100MB, so I'd rather not parse the data more than once if I can get away without it (although I wouldn't mind going through them twice if a first pass would generate some useful meta-information).

SO... What's a good way to deal with this? As it is, I may be faced with just opening the correct output file based on the sender IP, perhaps writing the HTML 'top', writing a line, closing it, and on to the next line. All that opening and closing files seems bad somehow, so I'm seeking the wisdom of the Monastery.

A second possibility - if they won't be used often - is to pull a list of IPs from the log files and dynamically write CGI scripts as the links instead of HTML files. The CGIs, when accessed, would `zcat logs.gz | grep <ip>`, basically generating the list of traps for a given IP at runtime. Quick to make, slow (and expensive) to use very often.

So what do you think? Easy way out of this? Should I just risk opening a zillion filehandles? Should I just open them and close them one at a time? Suggestions are welcome.

--J


In reply to Parse data into large number of output files. by Rhys

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-24 13:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found