http://qs321.pair.com?node_id=662076


in reply to Building a database from XML data feed

My suggestion? Start simple. I mean as basic as you can. Something like, "always store the original XML in a dedicated dated reference table". Some people might balk at that, but from my frame of reference, you don't want to throw anything out in case you need to restart. It also creates a common DB-centric basis for schema building, even if you have to throw away attempts. Since it is expected that you already have tools capable of producing most of what you want from the XML files, you don't lose any ground here.

Follow-up with some basic schema strategies. I, personally, would consider creating tables with the smallest common unique-key pattern while containing the largest field set. Plan on creating new timestamped rows, processed from each XML file and stored separately. That way you can back-trace what changed in each record and when it occurred. No matter what you do, you can't do it wrong because you have retained the original basis and the reports you came in the door with. If some XML field groups or components are expected to change, you can sub-group or pack them into a container field for later processing. My favorite container format is JSON, but YMMV.

Once you have found a schema building strategy that seems to work, you can extend your basis to this foundation, and continue your work into sub-table building. Certain patterns emerge as you carve out bigger chunks where smaller chunks become apparent. You can then diversify your storage tables for those features.

At some point, you will have a lot of not-normalized data in your database. That really is not an issue because what you are really targeting is the retention of the original XML file, and the resultant table system from it. If you plan your stairstep approach properly, all the data you need is redeployed into your current stable construction, which should allow you to drop some of those earlier exploratory tables.

Obviously, in this plan, you need to be fairly flexible on the reporting side. Every stable stairstep will need to have the schema changes taken into account. If you plan this carefully, your work acquiring the stored data should be extremely straighforward. Simple object-oriented setters/getters based upon revision should be all that you need to get the DB fields in the report generation system.

I am not familiar with Mergent Global Company Data, but being XML, there should be a ruleset with which the documents comply. It may be readable by you, or it may be white line noise. Even if it is the latter, it may provide you with some insight into the deviations your data might take.

Given that the previous developer is gone, I would plan to avoid any previous work that deviates from the plan I have suggested. My plan gives you a basis, a development path (from which you can withdraw and retry), and a methodology that permits the project as a whole to mature, produce intermediate stable results, and eventually produce a completely integrated XML file to DB storage solution.

  • Comment on Re: Building a database from XML data feed

Replies are listed 'Best First'.
Re^2: Building a database from XML data feed
by mattr (Curate) on Jan 13, 2008 at 09:22 UTC
    Yes I'm throwing out the old work. It requires building a new table for every company every year, is undocumented, ...I'll stop there.

    The spec is understandable. There is the question of building a facility to merge specific partial feeds together manually, or just rebuild the whole thing daily from a full feed. A rollback and maybe a way to lock fields from being updated.

    I've also been pondering a model built around the full XML feed loaded right into memory at server startup, which might make it more robust and configurable.. Also yes they say the schema will change but not how, I figure the most important parts won't change but would like to make it configurable by the admin so I do not have to support it forever. Certainly an update will be issued when an executive of a company is hired or retires, also new types of officers could be added, etc.

    So yes I can see a way to model the largest features of the XML structure in DBIx but am intrigued by the possibility of not greatly minimizing that. Somewhere though I'll have to do some degree of linking feed data to manually entered data, or importing them into the same database. It can all just be string data. Maybe json and yml could be useful.

    The data looks like this. Probably thousands of companies, here's just one. I think storing 500 companies is more what we need to do for now though.

      Replying to myself here.. I just found that I'll have to allow companies to be added manually, not just from the feed. So I will have to use a database it seems. Also will have probably 1000 companies and maybe grow up to 10,000 over some years. Thanks for your help.