The Joy of Legacy Code.
I'm sure you've all got your own war stories of legacy code
that has grown and grown until it resembles the delicate and fragile Jenga tower
lightheartedly described by Cozens and Clark above.
Not even Perl Monks has been spared:
Netscape 6.0 is finally going into its first public beta. There never was a version 5.0.
The last major release, version 4.0, was released almost three years ago.
Three years is an awfully long time in the Internet world.
During this time, Netscape sat by, helplessly, as their market share plummeted.
It's a bit smarmy of me to criticize them for waiting so long between releases.
They didn't do it on purpose, now, did they?
Well, yes. They did.
They did it by making the single worst strategic mistake that any software company can make:
They decided to rewrite the code from scratch.
It's important to remember that when you start from scratch there is absolutely no reason
to believe that you are going to do a better job than you did the first time.
First of all, you probably don't even have the same programming team that worked
on version one, so you don't actually have "more experience".
You're just going to make most of the old mistakes again,
and introduce some new problems that weren't in the original version.
-- Joel Spolsky on not Rewriting
Now the two teams are in a race.
The tiger team must build a new system that does everything that the old system does.
Not only that, they have to keep up with the changes that are continuously being
made to the old system.
Management will not replace the old system until the new system can do everything
that the old system does.
This race can go on for a very long time. I've seen it take 10 years.
And by the time it's done, the original members of the tiger team
are long gone, and the current members are demanding that the
new system be redesigned because it's such a mess.
-- Robert C Martin in Clean Code (p.5)
"It's harder to read code than to write it" (Joel Spolsky) - writing something new is cognitively less demanding (and more fun) than the hard work of understanding an existing codebase ... which might explain the typical exchange below :)
Developer: The project I inherited has weak code, I need to rewrite it from scratch
Boss: Will there ever be an engineer who says, the last guy did a great job, let's keep all of it?
Developer: I'm hoping the idiot you hire to replace me says that
-- Green Vs Brown Programming Languages
As indicated above, a grand rewrite is not necessarily the answer.
Indeed, I've seen nothing but disaster whenever companies attempt
complete rewrites of large working systems.
Apart from the daunting technical difficulties of performing the large rewrite,
there's often substantial cultural resistance to replacing established software,
even when the rewrite goes smoothly and introduces significant improvements.
Examples that spring to mind here are: GNU Hurd replacing Linux; CPANPLUS replacing CPAN;
Module::Build replacing ExtUtils::MakeMaker; Python 3 replacing Python 2;
and Perl 6 replacing Perl 5 ...
though, admittedly, Subversion and git seem to have faced little resistance from diehard CVS users.
That's not to say it can't be done though.
The great Netscape rewrite (ridiculed by Spolsky above) -- though a commercial disaster --
metamorphosed into an open source success story.
Another example of a successful rewrite, pointed out by tilly below,
is the Perl 5 rewrite of Perl 4.
Well, if rewriting won't help, what are we supposed to do?
We surely need to provide a glimmer of hope for those poor souls condemned,
day after day, to anxiously poking at a terrifying Jenga tower.
Maintaining such a tangled mess is cruel, inefficient, and
ultimately unsustainable for any business.
The only humane option left that I can see is to relentlessly refactor
legacy code, subsystem by subsystem, continuously and forever.
To always keep it clean.
To prevent it becoming a tangled tower in the first place.
Though such an approach seems sensible to me, it can be politically
problematic to gain funding for such an endeavor.
Apart from the difficulty of justifying the return on investment of such
work, you further incur an opportunity cost in that time spent refactoring
old code is time not spent developing new products and new features.
Unit Testing Legacy Code
For many years, I've argued passionately for the many benefits of Test Driven Development:
- Improved interfaces and design. Writing a test first forces you to focus on interface. Hard to test code is often hard to use. Simpler interfaces are easier to test. Functions that are encapsulated and easy to test are easy to reuse. Components that are easy to mock are usually more flexible/extensible. Testing components in isolation ensures they can be understood in isolation and promotes low coupling/high cohesion.
- Easier Maintenance. Regression tests are a safety net when making bug fixes. No tested component can break accidentally. No fixed bugs can recur. Essential when refactoring.
- Improved Technical Documentation. Well-written tests are a precise, up-to-date form of technical documentation.
- Debugging. Spend less time in crack-pipe debugging sessions.
- Automation. Easy to test code is easy to script.
- Improved Reliability and Security. How does the code handle bad input?
- Easier to verify the component with memory checking and other tools (e.g. valgrind).
- Improved Estimation. You've finished when all your tests pass. Your true rate of progress is more visible to others.
- Improved Bug Reports. When a bug comes in, write a new test for it and refer to the test from the bug report.
- Reduce time spent in System Testing.
- Improved test coverage. If tests aren't written early, they tend never to get written. Without the discipline of TDD, developers tend to move on to the next task before completing the tests for the current one.
- Psychological. Instant and positive feedback; especially important during long development projects.
So I was at first enthusiastic about the approach recommended by Michael Feathers in
Working Effectively with Legacy Code,
namely to (carefully) break dependencies and write a unit test each time you need
to change legacy code, thus gradually improving the code quality while
organically growing a valuable set of regression tests.
The Legacy Code Dilemma: When we change code, we should have tests in place.
To put tests in place, we often have to change code.
-- Michael Feathers (p.16)
Feathers further catalogues a variety of dependency breaking techniques to
minimize the risk of making the initial legacy code changes required to unit test.
Though I've had modest success with this approach, there's one glaring
omission in Feathers' book:
how to deal with concurrency-related bugs in large,
complex event-driven or multi-threaded legacy systems.
Unit testing, by its nature, is not helpful in this all too common scenario.
Overcoming this well-known limitation of unit testing ain't easy.
Unit Testing Concurrent Code
Test-driven development, a practice enabling developers
to detect bugs early by incorporating unit testing into the development process, has become
wide-spread, but it has only been effective for programs with a single thread of control.
The order of operations in different threads is essentially non-deterministic, making it
more complicated to reason about program properties in concurrent programs than in
single-threaded programs.
-- from a recent PhD proposal to develop a concurrent testing framework
See the "Testing Concurrent Software References" section below for more references
in this active area of research.
Though I haven't used any of these tools yet, I'd be interested to hear from folks who have
or who have general advice and tips on how to troubleshoot and fix complex concurrency-related bugs.
In particular, I'm not aware of any Perl-based concurrent testing frameworks.
In practice, the most effective, if crude, method I've found for dealing
with nasty concurrency bugs is good tracing code at just the right places combined
with understanding and reasoning about the legacy code, performing experiments,
and "thinking like a detective".
One especially useful experiment (mentioned in
Clean Code)
is to add "jiggle points" at critical places in your concurrent code and have
the jiggle point either do nothing, yield, or sleep for a short interval.
There are more sophisticated tools available, for example IBM's ConTest,
that use this approach to flush out bugs in concurrent code.
Agile Design
In our ongoing "debate" on TDD, Bob and I have discovered
that we agree that software architecture has an important
place in development, though we likely have different visions
of exactly what that means. Such quibbles are relatively
unimportant, however, because we can accept for granted
that responsible professionals give some time to
thinking and planning at the outset of a project.
The late-1990s notions of design driven only by
the tests and the code are long gone.
-- James Coplien in foreword of Clean Code
While Kent Beck's four rules of simple design, namely:
- Runs all the tests.
- Contains no duplication.
- Expresses all the design ideas that are in the system.
- Minimizes the number of entities such as classes, methods, functions, and the like.
are helpful in crafting well-designed software, I don't agree with
some extremists who claim that's
all there is to design.
Software design is an art requiring experience, talent, good taste, and
deep domain, computer science and software usability knowledge.
I feel there's a bit more to it than the four simple rules above.
So, to supplement Beck's four simple rules, I present my
twenty tortuous rules of non-simple design. :-)
- Learn from prior art. Use models and design patterns. Most designs should not be done from scratch. It's usually better to find an existing working system and use it as a starting model for a new design.
- Define sound conceptual models and domain abstractions. Unearth the key concepts/classes and their most fundamental relationships.
- Aim for balance. Avoid over-simplistic, brittle and inflexible designs. Avoid over-complicated bloated designs with too much flexibility and unneeded features. Be sufficient, not complete; it is easier to add a new feature than to remove a mis-feature.
- Plan to evolve the design over time.
- Design iteratively. Some experimentation is essential. Look for ways to eliminate ungainly parts of the design.
- Use a combination of bottom-up and top-down approaches.
- Apply Separation of Concerns and the Law of Demeter.
- Systems should be designed as a set of cohesive modules as loosely coupled as is reasonably feasible.
- Minimize the exposure of implementation details; provide stable interfaces to protect the remainder of the program from the details of the implementation (which are likely to change). Don't just provide full access to the data used in the implementation. Minimize global data.
- Systems should be designed so that each component can be easily tested in isolation.
- When in doubt, or when the choice is arbitrary, follow the common standard practice or idiom.
- Avoid duplication (DRY).
- Declarative trumps imperative.
- Use descriptive, explanatory, consistent and regular names.
- Reflect the user mental model, not the implementation model.
- Reserve the best shortcuts for commonly used features (Huffman coding).
- Establish a rational error handling policy and follow it strictly. Document all errors in the user's dialect.
- Interfaces matter. Once an interface becomes widely used, changing it becomes practically impossible (just about anything else can be fixed in a later release).
- Design interfaces that are: consistent; easy to use correctly; hard to use incorrectly; easy to read, maintain and extend; clearly documented; appropriate to your audience.
- Apply the principle of least astonishment.
- Consider the design from the perspectives of: usability, simplicity, declarativeness, expressiveness, regularity, learnability, extensibility, customizability, testability, supportability, portability, efficiency, scalability, maintainability, interoperability, robustness, concurrency, error handling, security. Resolve any conflicts between perspectives based on requirements.
Agile Architecture
A project has many stakeholders, each making an investment (time, money, effort)
into the project. Each will have different goals for the solution, and they may
measure value differently. The Agile Architect's goal is to deliver a solution
which best meets the needs and aspirations of all the stakeholders, recognising
that this may sometimes mean a trade-off. The Agile Architect must work in a way
that makes the best use of the various resources invested in the project.
The solution must be seen as part of a whole, which includes other systems and
projects. It must be robust enough to be changed and extended over time.
You must support further work, whether it is to change the solution or
simply to operate it efficiently.
The cost of change is significant in any major real-world system,
so the Agile Architect must balance planning for change against other goals.
The Agile Architect must also seek to manage and minimise complexity,
which helps to maximise stakeholder value.
The aim is a solution which is neither simplistic and brittle,
nor over-complicated by over-building for flexibility.
-- from Principles for the Agile Architect
Schwaber's Legacy Core/Infrastructure Catastrophe
In a 2006 Google Tech Talk,
Ken Schwaber stated that a chronic legacy core or infrastructure problem
existed with every single organisation he helped implement Scrum.
Unfortunately as I've been helping organisations implement Scrum,
I've run into a very common problem with every organisation.
What these organizations have is a problem called
Core or Infrastructure software.
This core functionality has three characteristics:
- Fragile; if I changed one thing in that core piece of functionality, it tended to break other things.
- No good test harnesses around it. So if you went in and broke something, you tended not to know about it until it was up on all the servers and then your customers would let you know about it. That's not good.
- Only a few engineers know how to work on it. There were only a few suckers left in the entire company who still know how to and were willing to work on the infrastructure. Everyone else had fled to newer stuff.
-- Ken Schwaber, Google tech talk on Scrum, Sep 5, 2006 (35:50)
Ken continued with a specific anecdote highlighting the strain this
core architecture constraint puts on a Scrum cross-functional team:
I remember one company that has about 120 engineers, developers of all
kinds of whom 10 are still able to work on the core functionality.
The other 110 are working on new stuff.
We brought all the engineers into the room.
We said, okay, the product manager for the first area and the
lead engineer for the first area come on up here.
Now select the people you need to do this work over the next
month, including, of course, the core engineers.
And they did and we said, okay, now leave, get out of here and start working.
... when we got to the fifth product manager and the lead engineer and
they said we can't do anything.
There's no core engineers left.
We looked around the room and there were 60 engineers left.
They were thoroughly constrained by the core piece of functionality.
If you have enough money, you rebuild your core.
If you don't have enough money and the competition is breathing down your neck
you shift into another market or you sell your company.
Venture capitalists are into this now, buying dead companies.
Design-dead software.
-- Ken Schwaber, Google tech talk on Scrum, Sep 5, 2006 (38:40)
This anecdote rings true with my experience; I've worked at many companies where
the original authors of critical core software had long left the company,
few folks understood it, and noone dared touch it.
How Does it Happen?
Say you've got a velocity of 20. But product management want more stuff.
And so, that's going to require, because that's more stuff, that's
going to require that you have a velocity of 22 to do it.
Well, gees, how are you going to get a velocity of 22?
Are you going to be smarter when you wake up?
Are you going to put in new engineering tools?
No, none of that will work.
So, what you'll actually do to get the increased velocity is of course
cut quality, because if you remove quality, you can do more crap, right?
Now if you do this and that release goes out on time,
some grumbles from the customers you know, whatever.
But customers always grumble and the product manager is promoted,
you know, drives a new BMW, parks in one of the fancy spots.
The next release that you start because you're working from a slightly
worse code base with clever tricks in it, unrefactored code, no tests --
the best velocity you can really do is 18.
Well, that's no good and noone's going to get promoted for that.
So the product management team comes down and says, guys you just gotta do it.
So you cut quality again but this time when you cut quality, the best you
can do is 20 because you're starting from a worse code base.
Now it takes about five years, release by release, for you
right here to build your own design dead product.
It's got two aspects to it.
One is, when we are told to do more, we cut quality without telling a soul.
It's just second nature.
I have trained over 5500 people and put them through an exercise like this,
but very subtle, very sneaky, where push comes to shove and they
have a choice of saying, well, we can't do it, or saying we'll do
it and cutting quality.
Only 120 of the 5500 said no.
All of the others just cut quality automatically.
It's in our bones.
The other part of this habit is product management, them believing
in magic, that all they have to do is tell us to do something and,
this is the illusion we support, by cutting quality, it'll get done.
And these are what's called good short-term tactics.
These are horrible long-term strategies because it's a back-your-company-into-a-corner strategy.
-- Ken Schwaber, Google tech talk on Scrum, Sep 5, 2006 (41:50)
While Ken's plausible explanation of how this happens
spookily reminds me of some of my commercial experiences,
there are doubtless other ways it can happen.
After all, to the best of my knowledge, no Perl 5 pumpking has ever been
offered a BMW as an inducement to get a release out early.
A Mythical Perl-based Commercial Company
For fun, and to better understand why this sort of thing happens,
let's consider what might transpire if Perl 5 or Perl 6 formed the
crucial core software of a commercial closed-source company
writing customer-facing software in cross-functional Scrum teams.
In this scenario, Perl is an internal tool; the customer doesn't
know or care about it, they just want a system that satisfies
their needs.
I speculate that most developers and product managers in such a mythical Perl 5-based company
would go for the BMW by working on new pure Perl 5 products because their velocity would
likely be an order of magnitude higher when writing new Perl 5 components than when changing
the underlying Perl 5 C core.
Not only that, but hiring expert C programmers with sufficient skill, intelligence,
and tenacity to change the Perl core would likely prove to be a significant constraint.
So I predict that in such a mythical commercial company, development of the Perl 5
C core would slow down, with only critical bug fixes applied.
Despite Ken Schwaber's dire predictions of "design-dead companies" rapidly going out
of business, I see this company as commercially viable for quite a few years
(though not indefinitely) because the Perl 5 C core is stable and proven,
with very few critical bugs, and, most importantly, is well decoupled.
That is, you can write new Perl 5 code without needing to understand
anything of the Perl 5 implementation.
And teams writing in Perl 5 are likely to be very competitive in the commercial
marketplace when competing against companies writing in C, for instance.
Such an approach, however, cannot be sustainable in the long term and
sooner or later you'll need to untangle your legacy code or
rewrite it.
Because Perl 6 is less mature and still evolving, the velocity of teams
using it to deliver customer-focused software is likely to be much lower than
for Perl 5 teams. That is, the team may be happily and productively writing
new Perl 6 code ... then hit an impediment that requires them to switch
context and add a new feature or make a bug fix to the Perl 6 core.
Team context switches like this are very harmful to team velocity in my experience.
This Perl 6 scenario is much closer to most commercial organizations today
because their core software is typically incomplete and still evolving.
Indeed, agile proponents encourage you to avoid the waste of writing customer
software that is never used with slogans like
Do the simplest thing that can possibly work
and YAGNI.
In summary then, to circumvent Spolsky's "Netscape Rewrite Disaster"
and sidestep Schwaber's "Legacy Core/Infrastructure Catastrophe",
companies must continuously refactor to keep their core software
in a clean and maintainable state.
Such unrelenting and diligent work requires formidable
discipline however, and few companies have the long term
perspective and the will to do it.
Other Articles in This Series
References
Agile Architecture References
Legacy Code References
Testing Concurrent Software References
Updated 23-jan-2011: Removed reference to Windows NT rewrite plus minor wording improvements.