http://qs321.pair.com?node_id=861678

pileofrogs has asked for the wisdom of the Perl Monks concerning the following question:

Ahoy, ye Monks.

I've recently had a few needs to take a text file full of data, usually a server log file, and analyse it in some way. It's pretty easy, if I know what I'm looking for, to write a script to tell me, say, how many times page X was loaded during the month of July. What's less obvious is what to do with the data when I don't know what I'm looking for yet. I'm looking for patterns, but I don't know what they are.

Right now, I've got two theoretically identical DHCP servers, except one of them is getting 1/2 the traffic of the other, which doesn't make sense. I want to analyse my logs and see if I can figure out a pattern. Maybe the one with 1/2 the traffic is getting no requests from computers in a particular subnet? Maybe it's only getting a certain type of request? What time of day has the most requests?

Basically, I'm trying to figure out what form to put my data into in order to ask any question I want.

I'm thinking the best way to handle this is to load all the data into a SQL DB and then run SQL queries at it to ask it the questions I come up with.

So, the question I'm really trying to get to is: what's a good strategy when you know you want to analyze some data, but you don't know specifically what you're going to look for? If I'm right that the first step should involve stuffing the data into a SQL database, are there genetic modules to help me do this? Or am I totally missing the boat and there's better ways to handle this. Or maybe I'm trying to be too sophisticated and the most efficient thing to do is change the code to ask and answer a different question each time?

I hope that made some sense....

--Pileofrogs

Replies are listed 'Best First'.
Re: Arbitrary Analysis?
by BrowserUk (Patriarch) on Sep 23, 2010 at 22:53 UTC

    Sounds like a really good reason for loading a couple of logs into a SQLDB to me.

    Put them in different, but identically structured tables, and then you can run the same queries against both tables until you zero in on the differences.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Arbitrary Analysis?
by Argel (Prior) on Sep 24, 2010 at 00:30 UTC
    You might want to give a centralized logging application like Splunk a try. Or just mine/munge the logs directly since it's only two servers. That's assuming you have ruled out network or OS related issues. If not, I would investigate there first because e.g. if the second server is slower to respond then more clients will end up being managed from the first server. Compare the routing tables, etc. too. Heck, at half the traffic just running Wireshark on the two might give you some clues.

    Update: You might also want to check out David Cross's book Data Munging with Perl.

    Elda Taluta; Sarks Sark; Ark Arks

Re: Arbitrary Analysis?
by sundialsvc4 (Abbot) on Sep 24, 2010 at 13:15 UTC

    I find that it is “a very good thing” to capture historical data like this in SQL databases ... and I usually use good ol’ SQLite for this.   (It is a great way to produce “a flat disk file” that you can query.)

    However, in your case, I agree that you should look at existing log-analysis programs first.   “Do Not Do A Thing Already Done.”

    My “gut instinct” with regard to this particular case is:   do you have a load-balancer?   Is there anything, hardware-wise, sitting in front of these servers and apportioning the traffic between them?   If so, it is very likely that it is dispatching the traffic in an unfair way.

Re: Arbitrary Analysis?
by core_dumped (Acolyte) on Sep 24, 2010 at 15:39 UTC

    Sounds to me like what you need is a Data Warehouse. If you build one, the methodology will suggest the form in which data should be in to make queries flexible and easy.

    There are several sites explaining this analysis, which takes you probably less time than what you will lose trying to figure complex queries on a normal DB. I have long lost the url's but just google 'data warehouse'.

Re: Arbitrary Analysis?
by TomDLux (Vicar) on Sep 24, 2010 at 19:51 UTC

    Considering only the open source web log analysis software listed by wikipedia, 2 are in C, 4 are PHP, and 2 are Perl. Provide us with the top 5 things wrong with these, and we'll assist you in developing an alternative</snarkiness>.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: Arbitrary Analysis?
by tod222 (Pilgrim) on Sep 26, 2010 at 15:16 UTC
    ...load all the data into a SQL DB and then run SQL queries at it...

    Using what table schema? What will your queries look like?

    Never underestimate the speed and power of simple perl scripts using regexps to filter and parse raw log files.

Re: Arbitrary Analysis?
by ig (Vicar) on Sep 27, 2010 at 23:58 UTC

    This doesn't respond to your question really, but I am reminded of http://www.crypt.gen.nz/papers/logsurfer.html which can do some analysis that might be difficult with SQL queries. For example, detecting particular sequences of events with time constraints. If nothing else, looking at its capabilities might give you ideas for analysis you could try and those ideas might better inform your decisions about how to store the data.

    As for a strategy, I would begin by capturing the data in its native format. Then I would think about what analysis I wanted to perform and look for tools that did that sort of analysis. Then I would put the data into whatever format those tools required.