Saving and Loading of Variables

madbombX has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I am running a script that runs through a logfile, parses it and perpetually tails it so I can continually accrue data and maintain information over an extended period of time. However, since it is a logfile, the data can get VERY large and thus the data structures inside of Perl are holding a lot of information.

Just for general information, the data structures are a small hash, an extremely large hash (with thousands of values) and a fairly large sized array of arrays.

The problem is that I don't have a way of saving state (or reloading state if the script terminates or is terminated). Therefore, I ask what is the most efficient way of accomplishing this. I am not a fan of simply using Data::Dumper and dropping it out to a text file every X minutes and then reading that back in on script load. Suggestions are welcomed and appreciated. Thanks.

Eric

Update: I think I need to clarify my objective here with specifics. I thoroughly appreciate all the answers thus far, but here is a better idea of what I am doing and hopefully that will aid the answers.

I am sorting the maillog with postfix and spamassassin and amavisd information. I am using the spam hit scores to create a graph using GD::Graph (and that is where the LARGE array of arrays comes in). I am then using the SPAM tests that each message fails to build a hash of which tests fail more frequently than others and which tests fail in combination with each other (this builds an extremely large and slightly complex hash as it has all the SPAM tests as hash keys at the minimum and then hashes of hashes of combinations). I am doing alright with visualizing the structure (although there may be better ways to accomplish what I am doing). (If there are better ways, I am all ears) I am just not sure how I would accomplish storing them in a DB. I use DBI::mysql a lot and I am familier with that but I am having trouble conceptualizing how I would stick that in a MySQL DB. It just doesn't seem like it would apply here.

Comment on Saving and Loading of Variables

Replies are listed 'Best First'.
Re: Saving and Loading of Variables by graff (Chancellor) on Jul 18, 2006 at 03:52 UTC
It might work to make your hashes into dbm files (though for the large one, you might encounter intermittent slow-downs, if the particular flavor of dbm file you use has to rewrite its index table as the hash grows). It appears that DB_File (the Berkeley DB) supports not only hash structures but also a storage method that would work well for an array ("$DB_RECNO"), but you'd probably have to "serialize" each sub-array into some sort of single scalar value in order to store it into the DB file. That will keep all your derived structural data on disk as the process runs and grows. Then all you need is a check-pointing strategy that will store the current byte offset into the input log file at regular intervals. On restarting after a shutdown, you should be able to open your DB files, seek to last known offset in the log file, read and process, and check for matching values in the DB files; skip log records until you find novel data. (Or something to that effect.) (update: Oh yeah, and you should actually consider using a real database to keep track of this derived structural stuff -- it'll be much safer, more stable and accountable, easier and quicker to search and fetch back old information, and so on. With the right table schema, there will be a lot less coding to do, and the code you do write will be a lot more powerful.)	[reply]
Re: Saving and Loading of Variables by bobf (Monsignor) on Jul 18, 2006 at 03:41 UTC
I can't quite tell if you're trying to store the parsed data structure or simply persist and analyze the data efficiently. If the former, Storable may do the trick. If the latter, a database might be preferable - a CSV file and DBD::CSV might be all it takes. HTH	[reply]
Re: Saving and Loading of Variables by planetscape (Chancellor) on Jul 18, 2006 at 05:52 UTC
While the suggestions you have already received regarding databases are probably best, if you wish an overview of some modules which do serialization, such may be found here: Re: How can I visualize my complex data structure? HTH, planetscape	[reply]
Re: Saving and Loading of Variables by Sidhekin (Priest) on Jul 18, 2006 at 03:35 UTC
Have you considered Storable? `print "Just another Perl ${\(trickster and hacker)},"` The Sidhekin proves Sidhe did it!	[reply] [d/l]
Re: Saving and Loading of Variables by HuckinFappy (Pilgrim) on Jul 18, 2006 at 04:49 UTC
Since you're already getting suggestions to head down the database route, I'll chime in my agreement. I've always avoided databases, because I didn't want to be tied to a machine/server/etc. But recently I've written code using DBI::SQLite and Class::DBI. The conbination of those two allows me to just write perl code, and not worry about SQL statements, and SQLite is simply a file, so I don't need to worry about keeping a server up and running. For me, they provided a nice entry into the world of databases.	[reply]
Re^2: Saving and Loading of Variables by duff (Parson) on Jul 18, 2006 at 14:59 UTC
I've been contemplating redesigning a particular system we use here that is an amalgam of flat files and custom access routines. Guess what I've been thinking about replacing it with? :-) SQLite++ DBI++ duff	[reply]
Re: Saving and Loading of Variables by zigdon (Deacon) on Jul 18, 2006 at 14:50 UTC
FreezeThaw, Storable, and Data::Dumper could all be used here - just write the serialized data to a file when the program ends, and read it back when it starts up: `use FreezeThaw qw/freeze thaw/; my $dataStruct; if (open(DATA, "frozen.data")) { my $ice = <DATA>; ($dataStruct) = thaw($ice); close DATA; } # .... do some stuff with $dataStruct open (DATA, ">", "frozen.data") or die "Failed to write: $!"; print DATA freeze($dataStruct); close DATA;` [download] -- zigdon	[reply] [d/l]
Re: Saving and Loading of Variables by idsfa (Vicar) on Jul 18, 2006 at 16:00 UTC
I'm not understanding your phrase "hashes of hashes of combinations" very well. If the order of testing does matter, it would not retain that information. If it does not, it seems like this would generate redundant information. It seems to me that you have a list of one or more rules, which a message either passes or does not: `+---------------+-----------+ \| Combination \| Failures \| +---------------+-----------+` [download] Where `Combination` is a (string-y) list of one or more rules which, when combined, cause a message to fail. A corresponding perl structure is: `%data{"@combination"} = $failure_rate;` [download] The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon	[reply] [d/l] [select]
Re^2: Saving and Loading of Variables by madbombX (Hermit) on Jul 18, 2006 at 17:10 UTC
When a message fails, I get the following line in my log file: `Jul 18 00:36:32 mail amavis[26338]: (26338-01) SPAM, <beachmiro@autoxray.com> -> <postmaster@example.com>, Yes, score=17.675 tag=2 tag2=5.4 kill=13.5 tests=[BAYES_99=3.5, HTML_50_60=0.134, HTML_MESSAGE=0.001, SPF_HELO_SOFTFAIL=2.432, SPF_SOFTFAIL=1.384, URIBL_JP_SURBL=4.087, URIBL_SBL=1.639, URIBL_SC_SURBL=4.498], autolearn=no, quarantine A5QG0LjkvtcT (spam-quarantine)` I pull out the tests and which start at `[` and end at `]`. The first thing I want to see is how many times each test was failed (meaning that URIBL_SBL was failed once so increment that count and so on). BAYES_XX (99 in this case) is tagged on EVERY message that gets tagged (the BAYES_XX tagging is a special tag in spam-assassin). So my other combinations will be {BAYES_XX + URIBL_SBL}++ and the same for all other test failures. Eventually I will be moving to other combinations of tests that I see failed very frequently. This means that assuming URIBL_SBL failed 3 out of every 5 messages marked as SPAM, I would use that in place of the BAYES_XX for a while to test that. Now part of my data structure would look like: Note: In case you've already read this message, I changed the data structure to look like what is below: `%tests{"BAYES_99"}{"Total"} = 540; %tests{"BAYES_99"}{"Value"} = 3.5; %tests{"URIBL_SBL"}{"Total"} = 24; %tests{"URIBL_SBL"}{"Value"} = 1.639; %tests{"SPF_HELO_SOFTFAIL"}{"Total"} = 3; %tests{"SPF_HELO_SOFTFAIL"}{"Value"} = 2.439; %tests{"BAYES_99+URIBL_SBL"}{"Total"} = 18; %tests{"BAYES_99+URIBL_SBL"}{"Value"} = 5.139; %tests{"URIBL_SBL+SPF_HELO_SOFTFAIL"}{"Total"} = 1; %tests{"URIBL_SBL+SPF_HELO_SOFTFAIL"}{"Value"} = 4.078;` [download] It maybe that my concept of a proper data structure for managing this information is wrong. But I am not sure how if this were to be serialized and not Data::Dumper'd that it would still be functionally correct when reloaded.	[reply] [d/l] [select]
Re^3: Saving and Loading of Variables by idsfa (Vicar) on Jul 18, 2006 at 23:47 UTC
me elides a long post which was made totally wrong by your clarification of the data The updated data is much more helpful, but still leaves some significant questions. For instance, the order in the hash keys of your combined tests does not seem to be related to the order in which they appear in the log file (compare "BAYES_99+URIBL_SBL" v. "URIBL_SBL+SPF_HELO_SOFTFAIL"). The "Totals" seem to imply that they are gathered over multiple runs, while "Value" is obviously the sum of the conditions matched by the current message only. And to be fair, this is the first time you mentioned wanting to record the combined score. I think your requirements need better definition: exactly what are you trying to measure? There is no way to guarantee that all of your processed data will be saved (think power failure). You need to decide what an acceptable level of data loss is. You will need to write out the current state to a checkpoint file (whether through a database, Storable, Data::Dumper or whatever) at least once before you exit. The more often you checkpoint, the less data you risk losing. The truly paranoid will note that you need some sort of transactional locking to protect against interruption in mid-update. You could install a signal handler to catch most of the things that could kill your program and have it checkpoint your current status. It won't work for non-maskable signals (or power cuts), but might help with your stated aversion. For the task you describe, you might want to consider just rotating your logs nightly (or hourly, depending upon your volume) and processing them "offline" rather than tailing the live log file. It avoids most of these concerns. The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon	[reply]
Re^4: Saving and Loading of Variables by madbombX (Hermit) on Jul 19, 2006 at 16:28 UTC