Swallowing an elephant in 10 easy steps

Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. ... In the book it said: "Boa constrictors swallow their prey whole, without chewing it. After that they are not able to move, and they sleep through the six months that they need for digestion." I pondered deeply, then, over the adventures of the jungle. And after some work with a colored pencil I succeeded in making my first drawing. My Drawing Number One. It looked something like this:
xxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
[download]
I showed my masterpiece to the grown-ups, and asked them whether the drawing frightened them. But they answered: "Frighten? Why should any one be frightened by a hat?" My drawing was not a picture of a hat. It was a picture of a boa constrictor digesting an elephant. But since the grown-ups were not able to understand it,...
The Little Prince, Chapter 1 by Antoine de Saint-Exupéry

I must be the snake the little prince was thinking of because I seem to be in the habit of swallowing elephants. I often find myself needing to learn a complex system inside and out very quickly. Sometimes, it is because I have a new client. Especially in my early days I would often get invited to lead the design on a half finished project after a series of the big guys (Accenture, etc) had successfully botched the project. Until then, companies don't really look beyond the obvious and safe sources of consultants. More recently, I have either needed to evaluate 3rd party software systems for in-house use or even our own code for refurbishment. And then there are volunteer projects.

Over the course of time I've evolved a strategy for working through complex systems rather quickly. I won't say it isn't a lot of work (it is), but it keeps me from going in endless circles, so that at least the work moves me forward.

I'd be interested in knowing more about how other monks handle going about learning a new system quickly. Not everyone learns the same way. Also if you see any non sequiturs or an obvious omission, I'd appreciate the feedback. Writing something like this up feels a bit like trying to explain how to tie a shoe. It is all to easy to take for granted a crucial step and leave it out. I also have an ulterior motive. pmdevils are also in the position of trying to swallow elephants when they join.

The ten steps can be summarized (they will be explained in more depth later below):

Gather together whatever documentation there is
Experiment with the front end
Study permanent data streams and persistant data
Explore the class/type systems
Understand where the code lives
Scan and categorize the codebase
Study back end CRUD operations
Study front end CRUD operations
Study permissions/security infrastructure
Explore a well defined functional area to see how the system works in a specific case.

This is, of course, an interactive process and it often involves a lot of backtracking as well. Often the first pass through a step results in more questions than answers. Then I go onto another step, and find an answer to something I couldn't figure out in the previous step. If the answer is important enough, I may go back and redo all or part of the previous step.

Although the first step is "gather together documentation", the process describe below will work even if documentation is sketchy and you have to manually go through code and database schemas. Documentation makes the process much easier (if it is correct), but it isn't necessary to the learning process.

Figuring out how much detail to do at any one step of the process is a bit of an art. Usually I try to be as superficial as possible. For each stage I try to learn enough to categorize things in some sensible way that tells a story about data, behavior or the interaction between them. Then I move on to the next step. When the fog to understanding ratio gets out of hand, I backtrack one or two steps and go into more detail until the fog clears.

As I work through the above list I usually keep copious notes and organize them as I go. Writing out my answers to the questions at each step helps me identify unanswered questions. I also find I'm accumulating knowledge much faster than I can absorb it so the notes act as a memory bank, especially if they are well organized. I frequently organize and reorganize the notes as I go. The act of organizing them also helps me remember more. Finally, if the system is poorly documented, the notes become a first cut at improved documentation.

As one skims through the list, the first thing that might stand out is that a third of the steps are data centric, including the first detailed analysis step (study data streams and persistant data). For many programmers starting with data may seem counter intuitive. Programming is about making data do something, and code is where the action is. But data puts an upper bound on what the code can do so it provides a way of focusing attention and understanding the scope of the system. It is also the easiest discreate nameable thing to get a handle on. It acts like the end of a thread used to unravel a ball of knots.

Another thing that might stand out to the OO fans reading the list above is that there is no mention of roles or behaviors or all the other jargon of the OO world. I find that interesting because I've been doing OO since the late 80's and it is virtually impossible for me to design software that is more than 500 lines without ending up having at least a few classes. Even before C++ became popular I was organizing C functions associated with specific data structures into dedicated files and designing dispatch handlers for the data structures. It just seemed like the right way to do things.

I'm guessing the reason for this is that data is a long hand that reaches throughout a system. Or maybe it would be better described as blood and oxygen. Every aspect of the system needs it regardless of its role front end or back end.

Classes and objects are marvelous ways to organize code. They can also be a great way to get a user to cough up requirements. Users often have a hard time talking about data apart from the things they do with the data. However, code and end-user requirement gathering are only two of the many ways a system needs to be categorized, sliced and diced in order to understand it. On the front and we need to understand workflows. On the database we need to understand normalization or we won't really get the benefit of our database's SQL engine. Somewhere in between we need to understand aspects: large swaths of functionality that are content independent.

To fully understand a system it is also important to get a handle on how all of these different ways of categorizing data, code, and end user functionality relate to one another. The learning strategy elaborated below tries to help in that process. Whether it succeeds is for you to decide. But if it seems counter intuitive at least consider trying it the next time you need to learn a system quickly. You might be surprised at the results.

I apologize in advance for the list like nature of this elaboration. Partly I don't have the time to expand it fully right now. Also, I fear turning it into a narrative with examples would likely stretch this node to book length. Hopefully though I will have at least raised some questions and pointed out things to look for that might be helpful for others.

Step 1: Gather together documentation

You may have documentation. You may not. It may be up to date. It may not.

Even out of date documentation or incorrect documentation can be helpful if it gives you a sense of the design philosophy or a road map through the code. My first step is to skim through the documentation and make some sort of assessment of what is there and how much I trust it. However, unless the documentation is amazingly clear and well written, when first learning a system I usually take the documentation with a grain of salt. I like to get down and dirty into the guts of things and see how it all works with my own eyes.

Even when I trust the documentation and it seems relatively complete I still use the remaining steps to as a road map through the documentation and check list for my understanding of the system.

Step 2: Experiment with the front end

From the user's point of view, what does the system do? Do this as a brief survey to get a "feel" for the application or system. The goal for this step is to provide context for the more detailed study of how the system is implemented.

What are its main areas of functionality as evidenced by user manuals, help documentation, or help systems.
How does the uer access these areas of functionality? Menus? Toolbars? Dedicated widgets? URLs?
For example, at Perl Monks all functionality is available via specially crafted URLs. However, one doesn't normally access features directly via URLs. Instead one either uses the link lists in the upper corner or one of the several nodelets found in the side bar. The upper right corner has links to the main functional areas (mostly), and the nodelets handle special purpose function areas.
What is the work flow like? How does one move from screen to screen or webpage to webpage?
Is there obvious security? How does it interact with the user? At login time? At other times?
How does it perform? Is it slow? Lightening fast? Something in between?Any obvious reasons?
What, if anything, is customizable? How does one do it? How much can an end user change? An administrator? What depends on level or security status?

Step 3: Study permanent data streams and persistant data

look for all the places data is stored or streamed
- databases
- disk files (data files and configuration files)
- code segments (see understand where code lives)
- real-time feeds of data stored/geneated elsewhere
understand the structure of each (or lack thereof)
how is structure revealed?
- data dictionary
- language (data is stored in a linguistic sample and needs to be parsed before extraction)
try to cluster the tables and data stores into functional categories (document and explan the role each table plays in these)
identify "core" tables that are shared by several different functional areas and seem to transend any particular functional area or topical organization.
look for mismatches between data structure and content, i.e. divergences from the assumption that (a) table = kind of thing (b) row = examples of a particular kind of thing (c) field = discrete atomic data.
- tables storing more than one kind of record: sometimes common data (ids, audit trails) are stored in a master table. All database rows have an entry but their real meat (the attributes that store actual content) are else where in a table that is joined one-to-one
- fields storing multiple attributes, repeating groups or even entire objects or sets of objects. Sometimes fields can function as mini databases in their own right.
keep refining until you feel the categories tell a cohesive story.
keep notes of what you've learned along with any unresolved questions

Step 4: Explore the type systems

look for all the different ways the system categorizes data and defines relationships between categories. Systems often have more than one of these:
- type fields on database records - field name often contains things like "type" or "kind". However, any field whose value determines which attribute sets or functionality is applicable, should be considered a type field. These fields aren't always conveniently named type or kind. The initial step of experimenting with the system as an end user should give you some hint of what sort of field to look for. For example, an expert system that searches for case history and legal precedent that can be used in a particular country might have functionality and attribute sets that are dependent on the value of a country's "legal_system_id".
- code structure - is it divided into classes? How do they inherit from one another? What is driving the division of code into classes? It may not be the same thing as what drives the database record.
  Some systems use classes primarily to organize the system around "aspects" - areas of functionality that can be defined indepenent of content. Content-based type systems are handled based on data. They have no corresponding code classes.
  
  For example, the Everything engine that runs the Perl Monks website uses classes to organize code into a database backend (inherits from DBI) and a front end request handler (inherits from CGI). It also has an elaborate system for assigning each design element and each node written by the user to a "nodetype". These node types can inherit from each other and they drive both the choice of applicable attributes and the available functionality. However, these node types have no corresponding classes in the code.
for each type system diagram (or create an outline) explaining the relationships between the types within the system.
explore and document how each type correlates to database tables and is implemented in code. 1-to-1 is often the exception more than the rule.
- is the type db only or does it have a corresponding code class?
- one table storing objects belonging to diverse classes (e.g. node table @ PM)
- multiple tables joined together 1-to-1 for each dbtable row (e.g. a single object table joins with one or more tables containing attributes specific to that object)
- subsets of records from a cross-reference or detail records (e.g. order object may include order lines, but db puts orders and order lines into separate tables for normalization reasons.)
- classes may store a private copy of data at the other end of a foreign key for convenience.
  Note: this is almost always a bug waiting to happen unless the duplicated data is immutable when the code was written and always will be from now until eternity. Or alternatively the object is very transient (e.g. one report run or edit transaction).
  
  It is *very* hard to keep that data in sync with the master copy. The code to do it often gets the edge cases wrong, creating subtle bugs. Also even if green field development got it right to start with, maintainers forget about this sort of denormalized data and usually fail to find all the places affected by a database change.)
explore how each of the type system interact with the other, if they do.
keep notes of what you've learned along with any unresolved questions

Step 5: Get to know where the code lives and how it is processed

places to look for code
- disk files
- database rows - which ones? which tables/record types? which fields?
- embedded in fields containing non-code data
- configuration files
how is it processed?
- what compilers/interpreters are used?
- is it filtered in any way before processing?v
- what restrictions are there on normal syntax?
- how are modules and libraries included in code stored in the database? Can any module be used? If so, how is the module or library installed in the system so that it knows where to find it?
with a better idea of where the code lives, you may want to go back and redo steps two and three looking for additional sources of data and/or type systems.
keep notes of what you've learned along with any unresolved questions

Step 6: Scan and categorize the codebase

If you are lucky a certain amount of categorization may have been done for you. If not, the only way to do this is to look at each file! Here are some things I do to speed up the process:

make a preliminary categorization based on file name. If the name is especially obscure, one might also very briefly skim the file to see what it is about.
Do not expect to get things right at this stage. Some files will be miscategorized.

When categorizing focus on (a) files that belong to a specific aspect, e.g. retrieval of persistant data, security management, caching/performance, XML generation, HTML generation, generation of format XYZ, HTTP get request processing, customization management, mail interface, code to handle specific content areas, and so on.
Try to explain how the categories fit together to create the system you saw in step 1. Are there any area of functionality that seem to be missing supporting code? if so review the code categoriation again. If one still can't see where the code that does X is coming from, keep notes.
Pickthe code in one category (preferably something simple) and study it in depth to get a sense of coding style, architectural style and documentation level.
- what are the entry points to this area of functionality used by code outside of this functional area? If you had to write a public API what would it look like?
- what precisely are their parameters and return values? what happens when illegal values are passed? what counts as illegal values? read the code to figure this out if the documentation is insuficient.
- is the code easy to read? dense but well written? a mess? is there a consistant style? How would you describe it?
- how is data used? how is code grouped into subroutines?
- how good is the documentation? does it match the code?
- how well organized is the code? Is hardcoded data isolated and easy to change or scattered about? Is functionality broken down into sensible chunks? Are functions well named and self-documenting? Is data?
keep notes of what you've learned along with any unresolved questions

Step 7: Study backend CRUD operations

How does the code creates, updates, deletes, and displays persistant data? Is it treated as an aspect with a generic engine? Is it left up to each class? A hybrid?
If there is a generic portion, identify all the code units that belong to it. If this is hard to do, go back to the previous step and review the code more carefully.
- Grepping for key words associated with database processing may help. For example, if the codebase is written in Perl, one would want to look for modules that use modules named 'DBI::', 'DBIx::' or 'DBD::'. Alternatively one could look for strings like 'sql' or even for specific database table or field names, especially if those fields come from "core" tables that aren't content specific.
- If code units are stored in the database, there may be a code search facility. Ask others more familiar with the system if there is such a thing and how to use it.
- once you have found a few starting modules you can either look at the modules they include or look for consumer modules. Sometimes these modules belong to other subsystems but sometimes they don't.
- the names of modules can also be a guide to finding other modules in the same subsystem. See if there are patterns in the way database associated modules are named and then look for other files that have similar naming patterns.
Study the generic database code in depth.
- what are the entry points to this area of functionality used by code outside of this functional area? If you had to write a public API what would it look like?
- what precisely are their parameters and return values? what happens when illegal values are passed? what counts as illegal values? read the code to figure this out if the documentation is insuficient.
Figure out how content-specific classes and type hierarchies use the database. One way to do this is to pick two or three types from each of the content related type heirarchies. Choose types and classes whose names are somewhat original. Searching code for a type named 'code' or 'HTML' or 'object' is likely to lead to a lot of false hits.
- take advantage of built-in systems for tracking data-code associations wherever possible: some systems use naming conventions or provide an infrastructure to track the associations between code and data structures or classes. For example, the Everything Engine used by Perl Monks has a page (visible to pmdev's only) that displays the database tables used by each node type along with all of the code units involved in validating data or generating HTML for pages that edit or display data from those tables.
- use grep and code search tools: look for the name of the type you are researching or the name of one of its superclasses if nothing is found. Even in system that track code-data relationships, there can be important information elsewhere in the system, especially in code units that contain subroutines called by the code units that are explicitly tracked.
- for each code unit found, study the input parameters, output parameters and internal code calls. What information must be passed to retrieve or edit data? What gets sent back? Does the code call the generic engine subroutines? go directly to the database? call other subroutines?
- If it calls other subroutines, look up those subroutines and ask the same questions. Repeat this proces until you know exactly how data moves from the front end to the persistance engine and back again.
how is security and user permissions enforced in the backend? is it handled in code? in the database? some combination of the two?
keep notes of what you've learned along with any unresolved questions

Step 8: Study front end CRUD operations

Taking the same types as before, study how those types are displayed to the user.

can the user choose from a variety of ways to present the data: XML, PDF, a pretty browser page, other? If so, what are the display options available? Do they vary depending on type? Using which type system? Do they vary based on the user's security status? what other factors affect the views of data available to the user?
how does the user request the system to display a particular view?
- click on a link? If so, what is the URL syntax of that link. When the system recieves an HTTP request, how does it process that URL?
- choose a menu option? If so, what event driven API is being used to process menu choices? What are the capabilities of that API? Which events are being used to handle the request?
- other?
what are the possible display types for the system?
how do the display types relate to CRUD operations? are there several differnt views for each operation?
how is each displaytype processed? Do any display types have generic handlers? If so, what code is involved in the generic handling process? Is there a way to customize the generic process for specific classes? Or is everything processed the same way?
for display types without generic handlers (or with customization of generic handlers), where does the type-specific code live? how is it triggered? How is it structured?
where does the look and feel of the front end come from? What data drives it? what code?
How is information about permissions used in the front end? Are certain user interface elements disabled based on permissions?
keep notes of what you've learned along with any unresolved questions

Step 9: Study permissions/security infrastructure

Where is permisison information stored? In the database? In code? In disk files? Other?
Are security checks centralized in one place? Or is security logic scattered about the codebase and various databae fields? If centralized, find all the code related to security and analyze it.
How do the front end and back end of the database work together to enforce security? If a user without permission somehow manages to trigger the front end code without a proper user interface, e.g. by sending a properly formatted URL, what happens?
keep notes of what you've learned along with any unresolved questions

Step 10: Study a well defined data collection/functional area

identify features visible from the front end
study the functional area until you can account for all behavior visible from the front end:
- identify all data used by the functional area (see step 3)
- identify all classes used by the functional area (see step 4)
- identify and trace through all code used by the functional area (see steps 5-8 for tips)
- reexamine all code for security and permissions management (step 9)
- repeat or backtrack above steps until all functionality is explained.
assess code organization and quality
- does coding style and quality match that found in <a href="#step6>step 6?
- is the data that drives the look and feel of this code segment centralized? If not where are all the pieces found? Can you construct a data structure that would store all the information in one place? how much would the code have to change to use such a data structure?
- is security data centralized? If not, where is it located?
assess extensibility: if you wanted to subclass the entire functional area and add custom look and feel or functionality, what would be involved?
- could the system support two custom instances side by side? Or is this functional area singleton? do all customizations affect the one and only copy?
- what would one need to do to extend the functional area? Are these extensions supported by the system? would you need to create additional database tables or fields? add new database records? modify existing code? add new code units?
- if data or security is decentralied, how will you deal with it? can you design the subclass so that it is more centralized? Or is that impossible without great disruption of the code?
assess performance: How is the performance of this area? What looks like it might cause problems based on past experience with other systems or similar designs and technology?
keep notes of what you've learned along with any unresolved questions

Nuf said.

Best, beth

Back to Meditations