Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Parse data into large number of output files.

by Rhys (Pilgrim)
on Sep 29, 2004 at 15:48 UTC ( #395025=note: print w/replies, xml ) Need Help??

in reply to Parse data into large number of output files.

I have read all of the suggestions and have the following comments:

  1. RDBMS: I will probably look into this in the near future. Slurping the data out of the files into an RDBMS would solve many problems. I should even be able to convince syslog-ng to stuff the messages into the DB as they arrive and do away with the flat files altogether (or use them merely as a fall-back archive). However, all of that is future development, so set that aside for now.

  2. Read into hash of lists, then print: This was actually my first thought, but the size of the input files is not controlled by anything. If there is a nasty network event, multiple GB of trap data could very easily be written which would definitely chew up all of my available memory (and it's a network management server, so that's not acceptable). Basically, I have a wildly unknown max value here, so I can't trust it.

    I suppose I could write out chunks of N*1024 messages, though, which would limit the open/close to every (interval) instead of every message. But since the number of senders can still cause huge memory usage (where each sender chews up N*1024-1 messages - which won't trigger a write - multiplied by M senders...), this probably still isn't the best solution for my case.

  3. Just open a zillion files: Again, the number of trap senders can vary wildly, so I have no way to trust that this number won't grow beyond any arbitrary limit I set. The system is Linux, and the last count on the number of senders is 584, so that's well below the current limit, but that can change in a hurry, and I'd rather just write this once. 64K would probably be big enough, but then what are the consequences of having that many FHs open at once?

    Basically, I'm loathe to go this route because I don't have any hard controls or expectations for the number of senders, and I don't want the thing to crash the first time the number of senders eclipses the FH limit.

  4. Cache of open files: This is another immediately-viable option. Basically, it puts a hard limit on something that otherwise has none. The dark side of this one is all the code required to maintain the cache itself. Shouldn't be too evil, though, and should chew up significantly less memory than any solution that involves buffering the messages in memory.

  5. Just do it one line at time: It could be argued that I should just open the file, write a line, and close the file, and see what the performance is like. If it doesn't suck, stop worrying about it. I have no argument against this (yet). The first version of the code will probably do this, since the FH caching algorithm can be easily added and it'll allow me to both guage the performance boost and test the REST of the code independently of this issue.

I still have to follow some of the links provided (such as the file caching one), so I haven't finished my analysis of this, but the suggestions have all been helpful. I'm trying to code this is a fairly paranoid way, just because I've had to re-write most of the code I didn't write that way, so I'm just trying to save time. :-)

Thanks for the help, all. Much appreciated!


  • Comment on Re: Parse data into large number of output files.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://395025]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2021-09-20 11:19 GMT
Find Nodes?
    Voting Booth?

    No recent polls found