Literate Perl? (was Re: Another commenting question)

I'd been thinking of doing a little blurb about Literate Programming and a thread on commenting seemed an appropriate place to bring it up.

Literate Perl ... who'da thunk it?

It's easy to be Perl literate---just invest a lot of time and hard work, read the camel start to finish (including the index), code 'till you bleed, and get some corrective lenses. Before you know it you'll be writing heavily punctuated, twisted loops and chains that solve the world's problems in 13 lines of code...and you'll know deep inside exactly why your particular solution is the best possible solution and you'll be justifiably proud of your accomplishments.

The next thing you'll find yourself experiencing is STCMDE. STCMDE (pronounced: ess tee see em dee eee) stands for Standard Time Cummulative Memory Dilution Effect and occurs with alarming frequency among programmers who actually find themselves revisiting code they wrote Some-Time-Ago (STA). STA is a relative value representing the amount of time-passed required to accumulate enough memory dilution to find yourself saying 'What the heck was I thinking when I wrote that!?". For many STA may be measured in weeks, months, or for a very few, even years. For myself, STA can often be measured in days or even hours.

The first thing you might notice when reviewing code you wrote STA is that your comments (if you used any) are somewhat inadequate---you realize that you wrote these comments while holding the overall design in your head in perfect clarity (you were writing the code at the time after all), but now that your mental model of the design has slipped into the abyss of STCMDE, you realize that your little comments are about as helpful as a string around your finger, or that unrecognizable name and phone number you found scribbled on the back of a matchbook cover from Fred's Bar and Dance Emporium where you vaguely recall spending some of last Saturday night.

Of course, being Perl Literate, you can almost certainly decipher and re-discover the brilliant algorithmic design you incoporated into your $world->solve_problems() method. However, this takes time and energy.

One simple solution to this problem is more verbose commenting. But verbose commenting can sometimes clutter up the code affecting readability, and often, even verbose comments for a particular subroutine don't express design ideas --- when the solution is clear in our mind we tend to think of the implementation and design as being obviously derivable from the code itself and that we'll only need reminders about the routine's interface and not its implementation.

Another problem with verbose commenting is that it is limited to text---expressing equations or diagrams is not always an easy ASCII option, but such extras can often be helpful in documenting algorithms, datastructures, business rules, and data flow.

Of course, there can and does exist stellar code which contains virtually no comments, but is readable---and grok-able---because attention was paid to code design, layout, structure, variable naming conventions and other details, and perhaps because the complexity was not overly taxing. But even with the best exemplars of such code, wouldn't you also like to read the author's own design notes, diagrams, and other such paraphenalia (even if, or especially if that author is you?). Just because algorithm X is perfectly clear and understandable doesn't tell you why X was chosen over the seemingly better algorithm Y (does the author know about Y at all, or is there some peculiar quirk that prevents its use in this instance?).

Literate Programming (LP) is one possible answer. But what, then, is Literate Programming? Well, Donald Knuth (who started it all) had this enlightening bit to say:

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do. (Donald E. Knuth, 1984).

And that sums up the point of LP in a nutshell -- we reverse the paradigm of embedding documentation within the source code, and instead embed the source code within the documentation.

Essentially, the practice of LP is to present your code in named 'chunks'. These chunks of code not only define the hierarchical structure of the program, but also contain the actual source code of the program. Think of it as named macros, or a templating system for presenting source code. For example, I could start describing the overall structure of a program and then define the "root" chunk (the chunk that defines the top-level structure of the program):

    Blah blah blah about the overall design, and the basic structure
    of the program is as follows:
    <<foo.pl>>=
    #!/usr/bin/perl -w
    use strict;
    <<load modules>>
    <<initialize data structures>>
    <<accept user input>>
    <<perform analysis>>
    <<error checking>>
    <<foo.pod>>
    @
[download]

Now, in the above, the <<foo.pl>>= token indicates the start of a chunk definition (the chunk is named foo.pl). Within that definition is a bit of code (shebang line, and strict) and then references to other named chunks (note, no trailing = sign means those are merely references to chunks, not chunk definitions, in other words, they are "includes" from elsewhere in the document).

Now I would begin describing each of those included chunks, in whatever order makes the most sense to present them in, and defining each chunk in the same way. Furthermore, any of those included chunks could themselves also include yet further chunks at a slightly lower level of detail. You can follow this approach from the top down, bottom up (but that's generally less easy to follow), or some combination of top-bottom-sideways presentation. So, the chunking scheme gives you stepwise refinement from pseudo-code (the chunk names) down to real code.

Another benefit is that this model isn't restricted to one program per file -- you could have two root chunks (each representing a separate extractable program) and present the programs in parallel. Why might we do this? We might develop our test program right along with our main program, and define chunks of test code in sequence with the main code being tested. Similarly, we could also include a complex test data file, and annotate it in chunks in sequence with the code chunks that parse it (chunk of header data, chunk of header parser, chunk of xyz data, chunk of xyz parser, etc.). In both of these situations, the tests and/or the test data lie near the relevant section of code in the same source file (surrounded by whatever documentation you deem necessary), and are easy to keep in sync (add to a given test/add to a given code chunk, add a new test/add a new code chunk).

When we've written our source, we "tangle" out the root chunks (a root chunk is simply any chunk that isn't referenced (or used by) another chunk). The tangling process assembles all the code chunks for a given root chunk and writes out the file (be it the main program, the test program, or a test data file). Weaving is the process of turning the original literate source into a typeset document. Generally you write the literate source in a typesetting markup language (such as LaTeX or HTML) allowing formatting control, tables, included diagrams, or whatever. When you weave it, any chunk indexing, identifier indexing, and cross-referencing is added (depending on options and the particular LP tool you are using) and then you either view the resulting source in a browser (HTML), or run latex on the resulting file (to create xdvi, postsript, or PDF output). Of course, you can also just use plain text for the documentation and not bother with weaving at all (but then you loose any cross-referencing and such that a woven document would give you). So, the basic processes are:

       --- tangle ----> foo.pl
      |
 foo.nw
      |
       --- weave -----> foo.tex ----> latex foo ----> foo.dvi
             |                    |
             |                     -> pdflatex foo -> foo.pdf
              --------> foo.html
[download]

What about debugging you ask? It would be a problem to track down problems in the tangled source and then have to locate where that code is situated in the literate source file. Fortunately, good LP tools help with this. For example, the noweb tool gives the -L option which inserts #line directives into the tangled source (Perl has line directives just like C) so that error messages will point to the relevant line in the literate source rather than the tangled source.

Literate Programming (LP) may not be the answer to your dreams of readable source code. However, it does offer improvements over verbose commenting. And while it certainly will not take the place of well thought out design (why bother documenting poor design anyway), it does at least encourage inclusion of the design process with the source code. Another area where LP can be a big win is in teaching -- when the goal is to teach and explain via examples, often an LP approach can be better than simple commenting or line-by-line post analysis.

I certainly don't use LP all the time (perhaps not even as often as I should) -- many scripts simply do not require such an approach, but is is worth looking into when developing larger projects. A few links to find out more about literate programming:

The noweb homepage. Noweb is a language independent LP tool. You can find links to projects and examples from there.
The LP FAQ
Mark-Jason Dominus has an article briefly highlighting LP as contrasted with POD.
PUBcrawl. Once upon a time I thought a literate perl newsletter would be cool. I twisted a few arms in my local PM group to try their hand at writing literate Perl articles -- we created one test issue and PUBcrawl has lain comatose ever since (contact me if you are interested in helping revive it).

Hey, my 100th post!

Comment on Literate Perl? (was Re: Another commenting question) Select or Download Code