mattr has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I would like some advice on loading a daily downloaded file of XML data into a database in a simple but robust manner. FWIW it is Mergentís Standardized Data Feed called Mergent Global Company Data, which includes for each company its history, summaries of its quarterly financial statements, lists of officers, text sections, etc. and itís a bit complex. For example the feed will change over time. It can have different sections indicating different classes of executives it lists, etc. Text sections are as <![CDATA[ Öhtml hereÖ ]]> too. I can get the full file or just the new bits delivered, by FTP. (No, this is not RSS here.)

I am planning on using this in a Catalyst app which normally would be using DBIx::Class, and will also have some manually entered data for fields not in the feed. Does anyone have experience with this kind of database updated by an XML data feed? Iíve skimmed some likely sounding modules in CPAN like DBIx::XML::DataLoader (in beta for 6 years now), DBIx::DBStag and its cookbook, etc.

The last guy who tried to write a schema for the feed quit the job, and I want to be a good lazy japh so now Iím even thinking about some way to just load the XML once on startup and search that. Otherwise, I need to rebuild the database from the whole feed automatically.

Use of the data will involve simple display of a companyís data and also printing that to a report.. something that can be edited and sent to PDF like OpenOffice or maybe LyX.

Mainly I want to keep it simple and not require lots of maintenance, so rather than making and maintaining huge db schemas and creating DBIx relations (a company has many statements, a section has many officers, etc.) I wonder if a simpler answer is possible. Otherwise I could just code the existing data and then merge in new data daily that the model understands.

Thanks for your help.

Matt R.

UPDATED: Looks like maybe using XML::Parser to build my table is the answe. Also found DBIx::XML::DataLoader, has anyone used it, is it being maintained? The docs are a little opaque to me.. but not as scary, I think, as XML::RDB.

Replies are listed 'Best First'.
Re: Building a database from XML data feed
by snopal (Pilgrim) on Jan 12, 2008 at 16:23 UTC

    My suggestion? Start simple. I mean as basic as you can. Something like, "always store the original XML in a dedicated dated reference table". Some people might balk at that, but from my frame of reference, you don't want to throw anything out in case you need to restart. It also creates a common DB-centric basis for schema building, even if you have to throw away attempts. Since it is expected that you already have tools capable of producing most of what you want from the XML files, you don't lose any ground here.

    Follow-up with some basic schema strategies. I, personally, would consider creating tables with the smallest common unique-key pattern while containing the largest field set. Plan on creating new timestamped rows, processed from each XML file and stored separately. That way you can back-trace what changed in each record and when it occurred. No matter what you do, you can't do it wrong because you have retained the original basis and the reports you came in the door with. If some XML field groups or components are expected to change, you can sub-group or pack them into a container field for later processing. My favorite container format is JSON, but YMMV.

    Once you have found a schema building strategy that seems to work, you can extend your basis to this foundation, and continue your work into sub-table building. Certain patterns emerge as you carve out bigger chunks where smaller chunks become apparent. You can then diversify your storage tables for those features.

    At some point, you will have a lot of not-normalized data in your database. That really is not an issue because what you are really targeting is the retention of the original XML file, and the resultant table system from it. If you plan your stairstep approach properly, all the data you need is redeployed into your current stable construction, which should allow you to drop some of those earlier exploratory tables.

    Obviously, in this plan, you need to be fairly flexible on the reporting side. Every stable stairstep will need to have the schema changes taken into account. If you plan this carefully, your work acquiring the stored data should be extremely straighforward. Simple object-oriented setters/getters based upon revision should be all that you need to get the DB fields in the report generation system.

    I am not familiar with Mergent Global Company Data, but being XML, there should be a ruleset with which the documents comply. It may be readable by you, or it may be white line noise. Even if it is the latter, it may provide you with some insight into the deviations your data might take.

    Given that the previous developer is gone, I would plan to avoid any previous work that deviates from the plan I have suggested. My plan gives you a basis, a development path (from which you can withdraw and retry), and a methodology that permits the project as a whole to mature, produce intermediate stable results, and eventually produce a completely integrated XML file to DB storage solution.

      Yes I'm throwing out the old work. It requires building a new table for every company every year, is undocumented, ...I'll stop there.

      The spec is understandable. There is the question of building a facility to merge specific partial feeds together manually, or just rebuild the whole thing daily from a full feed. A rollback and maybe a way to lock fields from being updated.

      I've also been pondering a model built around the full XML feed loaded right into memory at server startup, which might make it more robust and configurable.. Also yes they say the schema will change but not how, I figure the most important parts won't change but would like to make it configurable by the admin so I do not have to support it forever. Certainly an update will be issued when an executive of a company is hired or retires, also new types of officers could be added, etc.

      So yes I can see a way to model the largest features of the XML structure in DBIx but am intrigued by the possibility of not greatly minimizing that. Somewhere though I'll have to do some degree of linking feed data to manually entered data, or importing them into the same database. It can all just be string data. Maybe json and yml could be useful.

      The data looks like this. Probably thousands of companies, here's just one. I think storing 500 companies is more what we need to do for now though.

        Replying to myself here.. I just found that I'll have to allow companies to be added manually, not just from the feed. So I will have to use a database it seems. Also will have probably 1000 companies and maybe grow up to 10,000 over some years. Thanks for your help.