http://qs321.pair.com?node_id=869648

An anonymous large-business client has, over the years, developed a very substantial reporting application which today consists of many thousands of SAS®, DB2® and Oracle® source files ... all stitched together with an amazing collection of Korn Shell scripts and driven, across a small cluster of servers, by IBM’s Tivoli® Workload Scheduler (TWS™).   The application is successful, and therefore, growing rapidly, with more than 850 database tables now and dozens being added each month.   (These databases are extremely large, with billions of rows in many of the tables.)

The application is well-designed and it works well.   Nevertheless, when an application expands to this size, it becomes difficult for the highly-skilled programmers who are working on it to fully anticipate the consequences of a change.   Even the process of reloading the tables with fresh data (which occurs each month) tends to rely heavily upon the colloquial knowledge and abilities of “key men and women,” especially as new types of information are being added.   Such a scenario begs for computerized assistance.   My “Cool Use for Perl” is to briefly describe how that is now being done.

The key technology that is being used here is parsing, which is being done using Parse::RecDescent.   In some cases, the target grammar is well-known and (fairly) rigid, such as is the case with Tivoli schedules and job-files.   But in other cases, such as the myriad SAS, Shell, and various SQL source-files, there is little rigid structure and the process consists of “gleaning” some useful portion of what can be known about a particular file’s structure and content ... capturing as much as possible of the information that is useful, while discarding the rest ... without “falling down” (parse failure) or overlooking something that is useful.

The information is initially captured, for convenience, in SQLite (http://www.sqlite.org) databases.   DBI provides an excellent DBD::SQLite driver for this purpose.   The software uses randomly generated primary-key values (“monikers”) that can be relied-upon to be unique even across systems, so the databases of information gleaned from different systems can be combined with a simple rsync and SQL INSERTs.

The direct output of the parsing process is an abstracted parse-tree in the form of a Perl hash/array structure.   These are encoded using YAML::Old and stored in Text fields (which, in SQLite, can be arbitrarily large).

(The Storable module was first used, but it proved to be unreliable on these machines.   Furthermore, YAML has the very practical advantage of being readable by humans.   You can capture a field as text, cut-and-paste it into a text editor, and understand it, yourself.)

The necessity of actually parsing these files, vs. simply grepping them for the presence of useful-looking regular expressions, is something that might not be immediately apparent – but it allows considerably more useful information to be recovered.   It allows us to know when a particular dataset is used (without being changed thereby); when rows and values in it are being defined (that is to say, changed in some way); and when status and metadata updates are being performed, such as CREATE, RENAME, DROP.   When an SQL query uses cursors, views, sub-SELECTs, and so forth, much of this relationship information can, in fact, be gleaned.

You must, however, resign yourself to the practical notion that not every file can be parsed and not every fragment of information can be gleaned.   Both SAS and Shell scripts are programs, and this means that they may contain macro-expansions and other features which will vary at runtime.   If you try to capture everything, you are wandering down the primrose weed-choked path of diminishing returns.   Your database must not only capture success (parse trees), but also failure (error messages) and status (done/not-done / attempted / failed).

Your processing steps must be re-runnable and re-startable.   You need to capture modification-time data (and SHA1 Digests) to recognize when files have and have not been changed.   Your code needs to be runnable by different users in case some users have access to the necessary source-files while others do not.

Once files have been “discovered,” and the discovered files are then parsed, you are left with many megabytes of information that, although it now consists of Perl data structures in YAML format, must still be curried for useful information.   This is being done, somewhat less than ideally, using custom classes based on Data::Leaf::Walker.   As the walker touches each leaf of the tree, the handler is given an arrayref containing the path taken to get there ... and the code looks upward along that path.   The Parse::RecDescent grammars have placed specific markers within the data-structure, which can now be found and used.

Additional steps are needed after the data has been extracted and assimilated, in order to make it more consistent and useful.   Then, tools must be built to place into the hands of the programmers and analysts.   It must be emphasized that this is not a “silver bullet.”   However, it does provide a source of fairly-reliable information which is an important tool both for information workers and for senior management.   If a script can, say, generate a Tivoli job-schedule (to STDOUT) in a few seconds that is even approximately correct, along with the supporting information (to STDERR) that was used to derive it ... quite frankly, that is “a huge win.”   If the same tool can answer questions about existing data and script dependencies – let us say, “even to an accuracy of 75%,” once again, “that's huge,” because the situation that you’re in without any sort of computerized help is “zero percent.”

And the really “cool” aspect of this whole thing is, it really didn’t take that long for me to do.   Only a few months.   (Of course, I’ve been developing software for 30 years, and have specific expertise in these areas, so “your mileage may vary.”)   Nevertheless:   the concept is viable and it actually works.   A complete file-scan can be done in about ten minutes; the results database can be rebuilt in about ten minutes more.   (Again, these are honkin’ fast AIX® machines.)   Although we are still looking “through a glass, darkly,” that glass is becoming clearer and clearer all the time.   Perl has done everything that I have asked it to do ... with power, grace, and style.