Re: Parse PHP or Perl and Reconstruct MySQL Schema

I apologize in advance for the length of my response, but I've had to do this sort of task a couple of times...

Depending on the coding style of the original programmer, the task could range from straightforward to absolute chaos. If the programmer uses a consistent coding style, e.g., using $sql for SQL text, explicitly naming columns to be retrieved, etc., then it can be a pretty easy job. However, you'll still need to figure out the proper indices for the tables, as hesco mentions, and as samtregar mentions, type information could be difficult to infer.

On the absolute chaos side of things, the coder(s) could use inconsistent styles, making the SQL difficult to locate. They could just use SELECT * FROM TableName, as described by CountZero and UnderMine. Even worse, the programmer(s) could put all sorts of database manipulation in poorly-named stored procedures, and rely on triggers to do things behind your back.

So, before estimating the time it would take to reproduce the database structure, be sure to do a quick review of the code to see how it looks. After doing the review, if you still decide to bid on the job (and win it), here are a couple of tips that may help.

Preparation

Scan the code and pull out all SQL that you can find. For each statement, sketch out the tables, columns, and relationship(s) between them. In the immediately surrounding code, try to infer the meaning, use, type and range of values for each column. Jot down notes on what each column is used for, because you'll be wondering that later! Once you're done, build a new diagram showing all known tables, columns and relationships. (Use a convenient tool, as you'll have to revise this relatively frequently.)

Instrument the code so you can verify coverage of all functions, statements, and control-flow paths in the application.

Instrument the code to log every SQL interaction with the database: Show the SQL sent to the database (to decipher dynamically-generated SQL), and show the results returned by the database.

Write the DDL to create your database, and use it to create an initial database. Also write a DML script to insert any data that you know must exist from reading the code.

Now the grunt work...

Now you've got an iterative process before you. Create your database with the DDL script, populate the data with your DML script, and run your application.

For each run, try to exercise as many code paths as possible, and see what happens. In some runs, be "kind" to the application (i.e., give it "reasonable" data values) to help you get further into the application. On other runs, be "mean", giving the app oddball values and try to flush out any bugs, inconsistencies, etc., to help shed light on the applications interaction with the database.

Document each "interesting" test case that exercises some feature of the database and application that you don't already have a test case for. Also, try tests that may help expose "hidden" functionality in the database (those aforementioned stored procedures and triggers.) For example, after inserting a new record, check all the reports offered by the application to verify that the appropriate data shows in all locations. If not, it may give you a clue as to something new you'll have to add. (Here, interviewing the app's owner can be helpful. For example, they may tell you that the data normally doesn't show up until the next day, indicating a database maintenance task that runs once a day, or they could indicate that it should show up immediately, indicating that a trigger might be needed.)

Each time you learn something new about the database, update your initial diagrams with notes, e.g., "the contents of table.column always seem to be 'Y' or 'N'", and "Whenever table2.columnFoo holds 'M', there should be a record in table3 where the key is table2.columnBar+table2.columnBaz. Then update your DDL to add the new knowledge. As required/desired, add to your DML script to help exercise code paths, generate 'what if' cases to test, etc.

Also, when the application runs correctly, dump your database tables as DML so you can add it to your original DML script to load more data into the database on future runs. (Many features may not be testable without a reasonable quantity of test data.)

Constrain your data

In a relatively short period of time, you should be able to have a pretty good picture of the database. Then you can start looking over the relative frequency of various queries. Analyze this usage pattern to determine what indices to create on which tables. Analyze your table diagrams and see what foreign keys to add to your tables, what constraints make sense, and start tying down your database. After all, once you let crap data get into your database, your application will have to work much harder to produce reasonable results. Whenever possible, use constraints, foreign keys, default values, etc.

Database Maintenance

Finally, once you've done all that, start thinking about how the database will grow, and what sorts of maintenance it will require. Are there tables that will need to be cleaned out? Will some data expire? If database maintenance code already exists, be sure to test it out. If not, write what you think you'll need--and be *certain* to test it out. (You wouldn't believe the amount of database maintenance code that fails when it finally triggers because the authors either don't test it, or don't test it after schema changes.)

Save your work

Finally, put *all* your work in a binder and tuck it away for the next time you need it. (Even if it's not particularly neat, put it in the binder. If you never need it, it won't matter that it's not neat. But if you do need it, you won't mind that it's a bit of a mess. And believe me, you'll need it. About six months from now, a case will come up that you never tested (no matter how hard you try!), and you'll have to fix it. That's when the binder will be useful! Be sure to add the funky breakage to your test cases, and add any required data to your DML script, so when you make changes in the future, you can exercise that new test case. A lot of work? Yeah, it can be. But it's less work than just hacking at it. Especially the second time, when it's time to fix it later or add a new feature. That documentation will nearly always pay for itself.

This is what works for me, YMMV.

--roboticus

Why, yes, I *do* deal with many large databases and crappily-written applications with no documentation. How ever did you guess?

Comment on Re: Parse PHP or Perl and Reconstruct MySQL Schema Select or Download Code

Replies are listed 'Best First'.
Re^2: Parse PHP or Perl and Reconstruct MySQL Schema by dimar (Curate) on Jun 04, 2006 at 17:16 UTC
Your approach seems so thorough that it makes me wonder how you handle the inevitable "Situation Theta" when it happens. (where "Situation Theta" is defined as the set of all 'features' in the code introduced for reasons unknown, even to the original developer(s) at the time of introduction, but left in place anyway, for whatever reason, or no reason at all) Get enough "Thetas" and you may wind up 'rewriting' features that were never really implemented to begin with. I hope you got the credit for those. ;) =oQDlNWYsBHI5JXZ2VGIulGIlJXYgQkUPxEIlhGdgY2bgMXZ5VGIlhGV	[reply]
Re^3: Parse PHP or Perl and Reconstruct MySQL Schema by UnderMine (Friar) on Jun 04, 2006 at 18:58 UTC
I currently have a similar situation in respect to transfering an application from one server to another. The client says X doesn't work and we go back to the old server and lo and behold X never worked. Now your account handlers have to earn their money by explaining X never worked. Even more fun when the client has already paid for X and been told it was working. Transfer of applications from one company to another or one group of developers to another is never fun. UnderMine	[reply]
Re^3: Parse PHP or Perl and Reconstruct MySQL Schema by roboticus (Chancellor) on Jun 05, 2006 at 11:53 UTC
dimar: I've never heard it called "Situation Theta" before, but I like the name. Unfortunately, I don't have a strategy for detecting those. So currently, I just waste my time trying to get the code path(s) executed and corrected, only to be told by the client that "Oh...that feature never worked, and we don't really need it anyway.". <sigh...> If you're a nine-to-fiver, it's a real bummer. If you're a contractor, at least they get to pay for the time! --roboticus	[reply]