Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Caching process sets

by billyak (Friar)
on Feb 19, 2003 at 19:45 UTC ( [id://236784]=perlquestion: print w/replies, xml ) Need Help??

billyak has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a log parser. (My other question, Compiling Regular Expressions). Each log line has the standard timestamp followed by the event. I've found that if I seperate the timestamps from each line, approx 55% of the event portions of my sample log are repeats. The obvious solution would be to cache to results of the event parsing as to avoid repeatedly parsing the same thing.

In the case of each event, I do some simple math to a variety of data types. Example:

In the event of
$event = "a access b via c (d) <e> <f>";
then

$access{a}{b}++; push(@list,[$e,$f]); # if possible $count{b}++; $accessed{b}++; etc;

My question is, how would I cache such a set of simple operations and have them called should $event already have been parsed once? My only ideas include eval and from what I understand from reading and using it, eval is not the most speedy of options.

Any insight would be appreciated. Thanks,

-billyak

Replies are listed 'Best First'.
Re: Caching process sets
by Thelonius (Priest) on Feb 19, 2003 at 20:18 UTC
    I rather doubt it's worth the effort. I've written Perl programs to parse files of 500,000+ lines and it runs in 20 seconds or so. Are you actually experiencing long run times?

    If you can easily sort the file on the non-date fields, then identical items will be adjacent, so you won't have to use a large amount of memory for a hash cache. But I still question whether it's necessary

Re: Caching process sets
by dragonchild (Archbishop) on Feb 19, 2003 at 20:33 UTC
    It all depends on what you're doing. An obvious solution would be to use a hash (or HoHoHo..oH) that would keep track of what you've already worked on. You would parse the line, check the cache, and do the actions only if the line wasn't in the cache.

    As the other poster said, this is only useful if your actions per line are very expensive. "some simple math" doesn't sound like it would be expensive enough. However, 55% does sound like a potentially signficant savings.

    Without Benchmarking, it's impossible to know for certain, but the parsing is often the most expensive part of working with logfiles, not the actions one takes at each point.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Sorry, I guess I was not clear. I want to cache the set of actions. If line "aaa moo didley" appears thirty times, I want to establish a set of actions based on the first parsing, and follow through with this set of actions for each subsequent occurance of this line. The idea is to avoid the extra parsing by first looking up the event in a hash to see if there has already been a set of actions determined for it.

      -billyak
        You want closures. How you want closures ... that's going to be based on what you're doing. If you want more help, you're going to need to give a few examples of data and the actions you'd want to take on them. Then, one of us might be able to point you in the right direction.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: Caching process sets
by demerphq (Chancellor) on Feb 19, 2003 at 20:39 UTC
      A specific event line does more than what a return is made to do. I would need a seperate subroutine for each of my cached lines. Hence, the eval I mentioned in the original post.

      -billyak
        It's still not entirely clear to me what you're doing, but it sounds like you want are (anonymous) sub references.

        This code would check if an action correponding to $key is known, determine the action if it's not known, and in either case execute the action:

        ($actions{$key} ||= determine_action($key))->($key, $data);
        The sub determine_action would have to find out what the correct action is, and return a sub reference to it. If you don't want to make explicit subs, use anonymous ones:
        sub determine_action { my ($key) = @_ if (it's first action) { return sub { my ($key, $data) = @_; do stuff; }; elsif (it's second action) { return sub { my ($key, $data) = @_; do stuff; }; ... etc... }

        Instead of just evaling the code, wrap the code in an anonymous sub, thus capturing it so you can resuse it. So we have a routine called parse_to_actions that builds a bunch of lines of perl statements that need to be executed. Then we do this:

        my %code_cache; while (<>) { my $code=$code_cache{$_}; unless ($code) { my @actions=parse_to_actions($_); $code=eval "sub { @actions }" or die "$@ while evaling actions @actions "; $code_cache{$_}=$code; } $code->(); }
        Similar to what xmath posted, but building the subs dymacially.

        However when you consider that we can define parse_to_actions to return a sub, then we could

        use Memoize; sub parse_to_actions { return eval "sub { @lines_of_code }" or die $@; } memoize("parse_to_actions"); while (<>) { #Parse and generate. Memoize caches. parse_to_actions($_)->($_); # pass the line to the generated sub # just in case it gets smart }

        ---
        demerphq


Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://236784]
Approved by Enlil
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-20 02:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found