http://qs321.pair.com?node_id=509072

fizbin has asked for the wisdom of the Perl Monks concerning the following question:

I'm seeking design advice both general and specific for the following system. If you really want to say "Oh, Lord, no!" to the whole idea, I suppose that's okay too, but I'd like to know the reasons.

Here's the deal: my employer is in the business of shipping financial data (e.g. stock prices) from point A to point B and doing some mangling along the way. We have dozens of different data products that clients subscribe to. Now, this mangling/data processing system produces log reports. Sometimes, it produces just two or three a day (depending on the data product in question), sometimes it produces hundreds of logs per day.

Here's the thing - Data Integrity needs to go through these log reports. Now, DI people are nice and relatively bright about the data - and have tons of domain-specific knowledge - but most of them find the idea of ssh'ing to a production system and running "less" on the log reports a strange and alien idea. Therefore it has been decreed that there shall be built a web-accessible system for DI to use to look at logs with.

An aspect of this system - the web-based log access mechanism - is what I'm seeking design advice for.

Some further details: this new system is not (at least initially) for the whole of our product line, but just a few products that we're just now starting up using a new architecture.

We already have a module that summarizes each job's log into a report for DI that is usually about 10-20 lines long, but occasionally can be over 100 lines (depending on how much stuff went wrong during processing). This module is fairly fast, but still takes as long as a second or two on some of the larger log files. Also, occasionally DI will need to be able to view the raw log file, such as when they need to email a problem to the operations staff or the developers. So the report is good, but not always enough, depending on what went wrong.

Here's basically what I'm thinking:

  • We'll have two cgi scripts on each of the machines doing these data processing jobs:
    • one will take a product name and produce as output a listing of job names, something like what an ls -lt on the logs directory would produce, but htmlized with links to the other script
    • The other script will take a product name, a job name, and a "type" parameter and, depending on the type parameter, will display the log report, display the raw log, or initiate a download of the raw log.
  • We'll have one central web page that sets up a frameset with a list of data products in a column at the left; each of those data products, when selected, will change the right-hand side of the frameset to the current list of logs for that product.

So far, that would describe a system something like the DI setup for other existing products, except for the on-the-fly report generation. (Traditionally, reports are batch-generated in the middle of the night) One thing I'm thinking of adding, though, is an RSS or Atom syndication feed of new reports for each of the products - I've found an RSS reader a wonderful tool for wasting time, and it occurrs to me that it might be possible to harness it for good too.

So this gets tricky - what do I put in the syndication feed? Just the job name? The log report? How do I keep from overloading the system with requests to regenerate the feed - some kind of cache directory? What about how far back the feeds should go? And what about format? RSS vs. Atom? I'm inclined more towards Atom, because of the multiple timestamps (and Atom just seems saner to me).

Most of the data products in this new setup will generate less than five logs per day, at reasonably fixed time intervals. One will generate hundreds, throughout the day, at unpredictable intervals.

--
@/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

Replies are listed 'Best First'.
Re: Seeking advice on generating a syndication feed
by hakkr (Chaplain) on Nov 16, 2005 at 16:23 UTC

    Hi,
    I have this wrapper module for XML::RSS which should serve as an example, was based on the docs from XML::RSS. You could replace the sql used with another data source.

    package News::Feed; use CGI; use XML::RSS; use Date::Manip; #Wrapper for XML::RSS sub new { my $class=shift; my $att=shift; my $self={}; bless $self,$class; $self->_init($att); return $self; } sub _init { my $self=shift; my $att=shift; #hardcode 2.0 rss for now $self->{rss} = new XML::RSS (version => '2.0'); $self->{_dbh} = $att->{_dbh}; } sub getHeading { #todo } sub getItems { # For example print out/return the titles and links of each RSS it +em foreach my $item (@{$rss->{'items'}}) { print "title: $item->{'title'}\n"; print "link: $item->{'link'}\n\n"; } } sub createFeedXml { my $self=shift; my $details=shift; #Channel information is required in RSS. The title cannot be more + the 40 characters, the link 500, and the description 500 when output +ting RSS 0.9. #title, link, and description, are required for RSS 1.0. language + is required for RSS 0.91. #validate $self->feedParams () #todo validate input $self->{rss}->channel( title => $details->{title}, link => $details->{link}, description => $details->{description}, dc => { date => $details->{date}, lastBuildDate =>$details->{date}, subject => $details->{subject}, creator => $details->{creator}, publisher => $details->{publisher}, rights => $details->{rights}, language => 'en-us', }, syn => { updatePeriod => "$details->{updatePeriod}", updateFrequency => "1", updateBase => "1901-01-01T00:00+00:00", }, taxo => [ 'http://dmoz.org/Computers/Internet', 'http://dmoz.org/Computers/PC' ] ); } sub addItem { my $self=shift; my $details=shift; $self->{rss}->add_item(title => "$details->{title}", link => "$details->{link}", description=>"$details->{description}" ); } #set the image for the feed sub getImage { my $self=shift; my $details=shift; $self->{rss}->image(title => $details->{title}, url => $details->{imgurl}, link => $details->{link}, width => $details->{width}, height => $details->{height}, description => $details->{description} ); } # Parse raw xml into the XML::RSS object sub parse_sourcexml { my $self=shift; my $sourcexml=shift; $self->{rss}->parse($sourcexml); return $self->{rss}; } # Get the source xml with LWP sub getFeed { my $self=shift; my $sourcelink=shift; $self->{sourcedata} = get($sourcelink); $self->{parsed}=$self->parse($self->{sourcedata}); return $self->{parsed}; } sub writeFeedToDisk { my $self=shift; my $filename=shift; $self->{rss}->save("$filename"); } #feeds that will be generated from our db sub getOutputFeeds { my $self=shift; my $sql="SELECT * FROM news_output_feeds WHERE status='active'"; my $sth=$self->{_dbh}->prepare($sql); $sth->execute(); return $sth->fetchall_arrayref({}); } sub getInputFeeds { my $self=shift; my $sql="SELECT * FROM news_feeds WHERE status='active'"; } sub generateAllOutFeeds { my $self=shift; my $outfeeds=$self->getOutputFeeds(); foreach my $feeddata (@$outfeeds) { $self->generateOutFeed($feeddata); #clear oot rss obj $self->{rss} = new XML::RSS (version => '2.0'); } } sub sanitise { my $string = shift; $string =~ tr/\x91\x92\x93\x94\x96\x97\x19/''""\-\-/; $string =~ s/\x85/.../sg; $string =~ s/\x13//sg; $string =~ tr/[\x80-\x9F]//d; return($string); } sub generateOutFeed { my $self=shift; my $feeddata=shift; #run the sql get the data from db my $sth=$self->{_dbh}->prepare($feeddata->{source_sql_query}); $sth->execute(); use Encode; my $data=$sth->fetchall_arrayref({}); map {$feeddata->{$_}=encode('utf8', decode("ascii", $feeddata->{$_ +})) } keys %$feeddata; map {$feeddata->{$_}=sanitise($feeddata->{$_}) } keys %$feeddata; my $date =UnixDate('today','%a, %d %b %Y %H:%M:%S %Z'); #create channel details my $channeldetails={ title=>$feeddata->{title}, link=>$feeddata->{link}, description=>$feeddata->{description}, date => $date, subject => $feeddata->{subject}, creator => $feeddata->{creator}, publisher => $feeddata->{publisher}, rights => $feeddata->{rights}, }; $self->createFeedXml($channeldetails); # add data to feed foreach my $row (@$data) { map {$row->{$_}=sanitise($row->{$_}) } keys %$row; map {$row->{$_}=encode('utf8', decode("ascii", $row->{$_})) } +keys %$row; $self->addItem($row); } $self->writeFeedToDisk("$feeddata->{filename}"); }

      Am I correct in assuming then that you don't generate the XML feed on the fly, but rather save it off in a separate file?

      That might be the solution to my performance worries, but I'll have to think about how to trigger an update of the RSS feed when appropriate...

      --
      @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

        Yes it's created every day in a cron and written to disk with the writeFeedToDisk() sub. You could change it so it only writes the file when the source data changes maybe with a database trigger.

        Had the same worries about aggregators and news readers hitting my feed too much for it to be created dynamically

Re: Seeking advice on generating a syndication feed
by ptum (Priest) on Nov 16, 2005 at 16:30 UTC
    Why not create a simple relational database and redundantly log to that database whenever you log to a file? Then you can simply query your database from the webserver and get a customizable and real-time view of log activity without needing to re-parse your log files every time for every user. This approach would provide the ability to easily perform longer-term analysis of the log material as well as redundant backup of log information -- it also allows you to distribute the webserver and database on different platforms from your applications. Of course I know nothing about RSS feeds and this doesn't really answer your question, but it is perhaps an alternative you haven't considered. :)

      In the past, we have found that adding an additional relational database to a system has had a negative impact on the reliability and maintainability of the system as a whole. I really don't want to add yet another thing for operations to have to monitor.

      Although we are investigating database-based logging for other purposes, I also don't understand why it's better to turn data from a database query into html than it is to turn plain text sitting in files on the file system into html. In both cases, I query a module, get back some text, and turn it into html, except that in the case of using a relational database there's more setup work.

      Am I missing something?

      --
      @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/

        I'm not real fond of duplication of data, either. Ideally, you'd log directly to the database, as you mention you're investigating. The advantages are incredibly numerous:

        • You leverage (warning! Buzzword!) huge amounts of expertise that are likely already in your corporation on concepts such as reliability, security, scalability. That is, your DBAs can schedule backups and configure failover nodes, and even design for clustered servers. Most of this should be transparent to both the programs doing the logging and the programs that consume the logs (web app, whatever).
        • YOU don't need to worry about reliability, etc. Your RDBMS vendor has done that for you, and your DBA has been trained in how to do these things. Without the RDBMS, you now have to concern yourself with details like concurrency (writing to and reading from the same logfile) and transactional integrity (same thing - but imagine that the write is only partly finished when you try to read that record - RDBMS is supposed to prevent that from happening). Or even power failures - if you're in the middle of a write when the power goes down, you end up with damaged data. An RDBMS is supposed to be able to either be able to recover the damaged data or to remove it (lost data - but you lose the whole transaction or none of it).
        • You aren't nearly as stuck on a single technology (e.g., perl and CGI). You could, for example, give your DI folks a java application using JDBC that could do different fancy things. This is a great thing if they get really finicky - you can swap out front ends without worrying about the back end since the back end is completely standard. It's always nice to have choice in your tools - it allows you to select the best tool for the job. (If they're all on Windows, you could even use Visual Basic with ODBC, if that's what you have more skill in. Again with the choice of tools thing.)
        Imagine, for a minute, that this system becomes business critical. That means, no unscheduled outages are acceptable. Are you prepared to go business critical with it? Phone calls at 3AM? Given your description of this service, I could see that this type of service could be not only business critical, but a form of revenue. You don't want to be at the end of the "we're losing money without this working!" train. Being able to point fingers at the DBAs who point at the DB vendor, that's much more comforting. ;-)

        To a guy with a hammer, everything looks like a nail -- I am guilty of that sometimes in applying databases to problems. :) I guess I would ask a few questions about your data, positive answers to which might lead me to prefer a database over plain text files:
        • Is there a lot of data?
        • Is the data scattered across multiple systems/platforms?
        • Does the data lend itself to aggregation or categorization?
        • Is the data more transient than I would prefer?
        • Are the rules to parse the data complex?
        • Is the server on which the data resides used for other mission-critical operations?
        • ... and so on
        If the answers to these questions are all 'no', then there may be no benefit to adding yet another database. But if any of the answers are 'yes', then you may see considerable benefit in terms of performance, reliability, maintainability, etc. Just a few thoughts. I've done this a couple of times and I have never (yet) said to myself, "Dang, I wish I had just done this with files." Maybe that just demonstrates that I am stubborn. :)
Re: Seeking advice on generating a syndication feed
by Moron (Curate) on Nov 16, 2005 at 16:59 UTC
    So why not just configure the webserver to alias an accessible user-friendly url to the path of each relevant logs directory, with no coding therefore required.

    -M

    Free your mind