bsb has asked for the wisdom of the Perl Monks concerning the following question:

I'll outline the situation, what I'd like to achieve, how I think it could be attacked. I'm writing in the hope that this will help me find code out there that I can use, or at least to learn from people who have experience with related problems.

Existing Framework:

We provide a validation tool for a few simple fixed width data formats, it checks basic structure, domains and inter-record relationships. There's also a HTTP::Server::Simple utility to view the validation results, from a high-level or line by line. These are bundled together with PAR and deployed on Windows (although I develop on Linux). The database used is SQLite with Class::DBI::Loader generated classes and CDBI::Sweet for joining (yes, these were cutting edge at the time and no I don't want to change unless it's going to buy me a lot of time).

The formats are defined in custom YAML data dictionaries that specify the names, labels, types, lengths, domains, record layouts and relationships. The rest of the utilities configure themselves from this metadata.

See for an example ERD-thingamibob

Extension Goals:

Another stage of checking is to be added, which will be run after the existing checks. It will check logical consistency within and between records (not correcting just marking errors for display), checking against pre-supplied historical information, checking aggregates, counts and calculations against thresholds that are either predetermined or are themselves based on the file's data.

These rules will have be of varying levels of complexity and will probably have reflect the ad-hoc knowledge of the current distributed grey matter implementations. I envision the automatic validation nibbling its way up the complexity chain as more rules can be coded within it.

Some example rules:

Iff there Full-time Equivalent data has been provided for Xs there should also be Salary information for Xs

The average Salary (derived from other data) should be within a certain range.

There's plenty more.


I'd like a set of declarative rules, that have a machine understandable format which can be translated to SQL to checks that may find and log invalid records, but may also simply log the violation (eg. if the aggregate information is unusual). The same declarative format (maybe something like an AST) could be used to document the rules as text, images or useful error message (templated, filling in line numbers, field names, values and information links).

The rules should be able to calculate and name derived values that can be shared (and hopefully not recalculated, note that the processing stages are one-way and one at a time).

Better yet, the mini-language declaring the constraints would be intelligible by the current domain experts (non-programmers) and writable by them.

In my dreams...

Semi-Realistic Goal:

A framework that with rules that comprise of a name, documentation, a priority, maybe an error template and the constraint. The rules will probably be written in the highest level of abstraction that can express them and that exists. Until the high levels are written I'll be condemned to SQL, perhaps some rules will remain as SQL snippets due to there complexity, or to avoid performance issues.

The abstraction hierarchy may look like:

 SQL => SQL snippets => AST => Domain Specific Language

For the moment, this is all wishful thinking.

Over the next month I'll work on the lowest levels but would like to have a sketch of the higher-levels to guide me.


I'm looking at Tangram::Expr, QDRDBMS, SQL::Routine and Rosetta for hints There seems to be some alignment between what I'm after and the SQL AST field, but the alignment is probably not perfect.

I see a SQL AST as a good target for my to-be-invented-or-stolen DSL. The SQL AST would effectively define formulas for derived values, criteria for flagging violations, and constraints which need to be checked (different from db constraints).

The rules -> ast -> sql conversion could potentially be done at build time, therefore even Pugs and other languages can be considered.

Other ideas still floating around are Prolog, SQL::Abstract, S-expressions, DSL->SQL done using Perl or non-Perl tool chains, an Object::Declare-like Perl 5 DSL hack, Lua as a SQLite procedural language, Meta-Lua + Lua as the DSL, human-kind lives in peace and harmony forever, etc, etc, etc.

My current plans is write a few examples of varying complexity in SQL, which I'll probably have to do any way, and see what can be refactored. I'll post that when done to solicit further guidance.

Other Considerations:

The Windows deployment with PAR can be tricky to wrangle so I'm wary of big frameworks, platform dependency and C/XS, especially since I use pseudo-forking. I still have flashbacks...

Also, I should make clear that the database isn't being updated or more data inserted, so caching of intermediates is ok.

I hope that someone will be able to point me to useful code or warn me when there be dragons. Advice welcome,