The fine art of database programming

Database programming blues

Dear fellow monks,
Every now and then, somebody asks for advice about a problem involving database programming. Sometimes, it is a clean cut question, with the right balance between programming language (Perl) and database issues and a clearly defined example. In these cases, the question gets answered, and everybody is back home happy.

Sometimes, however, the poster doesn't have the foggiest idea of what is wrong, and submits an example with a mix of basic language, DBI, CGI and SQL problems. Then, many people answer these questions, giving specific advice on the part they found wrong, and the basic problem stays unsolved. Not only, when the right answer comes (the one that identifies the inner problem and gives the soundest piece of advice) it may happen that it is lost into a sea of comments that advise on how to correct some macroscopic but unrelated stylistic blunders.

In such cases, I realize that the basic problem afflicting the poster - and some of the willing helpers - is a misunderstanding of database principles.
This article is for both of them, for the ones who want to ask for advice and the ones who are offering their help. An outline of database programming, with particular emphasis on Perl programming, is necessary for both asking the right question and recognizing the right answer.
This article is neither a database nor a Perl tutorial. It's just a categorization of the problems we may face when developing database applications. Although some thoughts can be generalized to any high level language, most of this meditation is specific to Perl.

Before you go on

I assume that you know at least what the DBI is. If you don't, please have a look at the tutorials on database programming, whose contents I don't want to duplicate here.

Programming principles

The soundest approach to programming I have ever come across was the one that explained computer programs as being composed of three parts:

input
processing
output

No matter how complex your program is, it could be always reduced to these three components. Sometimes it's hard to identify one of them (in embedded systems, for example) but they are always there.

The best organization we could have in our programs is to keep those three elements separated as much as we can. This way, if our input device (or protocol, interface, or whatever) changes, we have to modify only one element of the program. Moreover, if something is wrong, we can more easily identify which piece is faulty.
We can follow this approach no matter which technology we are using. In either structural or Object Oriented programming, there are resources and idioms to isolate those elements.

When dealing with databases, each of these parts could be replaced by a database interaction. Sometimes even all three. Here's an example. The first input is from a database table with the official holidays for the organization. If today's holiday, then the second input is the list of employees working today, for whom we'll calculate the appropriate wages, depending of their job and seniority (which we also get from the database). The program's output is a database update of their payroll records.
The above example is not a standalone program, but it's part of a more complex application to deal with human resources management. Depending on the business rules that small program could be just a periodic job with no human intervention, or it could be part of a stricter work-flow, where each input and output requires a supervisor's approval. Either way, having the three component separated will help to integrate this program into the rest of the application.

Source code intricacy

One additional burden is due to different source codes that you might need to use at once for your program. A typical case is a Perl script getting information from a database using SQL queries, and sending output to a web server, thus generating markup language such as HTML or XML.
There are several ways to avoid cluttering your script with many types of code at once. Query generators can hide the SQL code. Templates and wrappers can generate markup language for you. In these cases, you trade code simplicity for debugging complexity.
This problem is especially felt when you are creating SQL statements through some wrappers that replace the SQL code with some other "simpler" instructions. The more you trust a tool (especially if you don't fully understand what it is doing), the more you are exposed to side effects that you might be unable to investigate properly when hell breaks loose.
About wrappers, there are two articles in www.perl.com, the first advocating the the virtues of a wrapper to simplify your coding even at the price of a huge efficiency hit; the other, brilliantly expressed by chromatic, shows how the DBI, assisted by some useful Perl idiom, can deal with any database problem.

In this subject, I am biased toward using the most powerful tool (the DBI), even at the price of some additional typing. It is true that, when you are a beginner, you feel intimidated by the DBI, and you'd like some simplifying solution that will save you some typing and will hide those hideous database details. However, you don't want to be a beginner forever. Or do you?

Some reason for complexity in database programming

The different languages in your script, which we have seen above, are only one symptom of the complexity involving database programming. The real complexity lies in some often overlooked properties of databases.

Client/server

Modern databases work with some sort of client/server protocol. The Perl DBI accepts this behaviour as a common ground for all databases and builds all its structure around such concept.
It is important to grasp the general idea of client/server - and even better would be to understand how your specific database is implementing it.
A database server is a program waiting for a request, which could arrive through the network, via an internal socket, via a file or whatever. The basic principle is its waiting. Then the client sends a request.
The DBI supports the idea of a request being prepared, i.e. analyzed for correctness and possibly optimized, then reused several times with either the same or different parameters. Anyway, if your database doesn't support the preparation, it is conveniently simulated by the DBI.
After a request has been successfully prepared, the client can ask for its execution. Notice that nothing is assumed by the database server about your intentions. Whatever you want, it's your call. It means that the server doesn't go on unless you ask.
So you have asked the server to execute the request. What does that mean? It means that the server will execute your query, and save the results into a temporary buffer of memory, waiting for more requests.
Yes. This is the part that's difficult to come to terms with if you are used to desktop database applications, where your request is the last word, meaning that you have triggered an irrevocable chain of events that will result into a list of records on your screen. A client/server database will wait for your request before doing anything, and after each step it will wait for your next request.
So, you have asked for the query to be executed. The server complies, and now it's waiting for your next wish. For how long it's going to wait? It depends on the database, but it's a safe assumption that its patience will not exceed the life of your connection thread. When you disconnect, or your thread is timed out due to your inactivity, all the temporary space allocated to your request is freed.
It's time for you to get the results from the server. You may ask to have all your results at once or one by one, depending on your estimate of their size and on the availability of such services in your database. The methods $sth->fetchrow_array, $sth->fetchrow_arrayref and $sth->fetchrow_hashref will get you the next available record from your temporary buffer in the server. When the last record has being claimed (or when you issue a $sth->finish method) the server considers your request satisfied and frees the memory occupied by your records. It will still retain information about your query preparation, until you disconnect.

You should bear in mind these steps when planning a database program, especially if you know that many clients could request the same service at the same time. Then, if you have a performance problem, you might guess where the problem lies.

I'll give you one example where a misunderstanding of the client/server principles creates a waste of memory and thus a reduction in general performance.

my $dbh = DBI->connect($DNS, {RaiseError=>1});
my $sth = $dbh->prepare(qq{SELECT product, price, quantity 
FROM products 
WHERE quantity < ? });        

$sth->execute(100);
sub ask_customer {
    # display product details
    # ask customer input 
    # do something smart with customer answer
}

while (my ($product, $price, $quantity) = $sth->fetchrow_array()) {
    ask_customer ($product, $price, $quantity);
}
$dbh->disconnect();
[download]

This script will keep the results locked in the database buffer until the customer has entered some input about all the records. If many clients use the same approach, it will result in the database draining out its memory very soon.
The right approach would be to save all records into an array and then asking for the customer's input. Even better, this situation calls for using one of the fetchall_* methods, asking the server to send you all the results at once.

Data structure

If I had to name a single item where database programming is creating the most confusion for the beginner, I would say that the data structure is the first candidate.
When dealing with a database, your program has an internal data representation that usually doesn't reflect the external data organization. Most databases store data according to the relational model, where information is organized by tables, without redundancy. The links between data in different tables are called relations and are implemented by using key values.
A typical - albeit simplified - database representation of a sales system includes information about the customers and the products. Each customer has a unique field, also known as "primary key", which identifies the record. The same is true for the products.
Thus the order records, instead of repeating all the values from customers and products, store just a reference to them, their identifier. When used in such context, a field referring to a primary key in another table, the identifiers are known as "foreign keys".

+----------------------+   +-----------------------+
|       customer       |   |          order        |
+-----+--------+-------+   +--------+--------+-----+
| ID  | name   | state |   | custID | prodID | qty |
+-----+--------+-------+   +--------+--------+-----+
| C01 | Joe    | NY    |   | C02    | P03    |   9 |
| C02 | Frank  | NY    |   | C02    | P01    |   2 |
| C03 | Bill   | TX    |   | C02    | P02    | 200 |
| C04 | Moe    | MA    |   | C05    | P01    |   3 |
| C05 | Sue    | CA    |   | C05    | P02    | 450 |
+-----+--------+-------+   | C01    | P04    |   5 |
                           | C01    | P03    |   1 |
+----------------------+   +--------+--------+-----+
|       product        |
+-----+-------+--------+
| ID  | price | name   |
+-----+-------+--------+
| P01 | 1.90  | hammer |
| P02 | 0.05  | nail   |
| P03 | 5.50  | pliers |
| P04 | 4.00  | cutter |
+-----+-------+--------+
[download]

This is a rather simplified view, which you should integrate with some more complete explanation of the relational model. (See some references at this address.) Here I just want to point to the shock effect of having this data structure in your program.
If I want to get the above data into my program, I have to decide how to represent the records. Should I blindly copy the same data structure?

my @customers = (
    [ 'C01', 'Joe',   'NY' ],
    [ 'C02', 'Frank', 'NY' ],
    [ 'C03', 'Bill',  'TX' ],
    [ 'C04', 'Moe',   'MA' ],
    [ 'C05', 'Sue',   'CA' ]
);

my @products = (
   [ 'P01', 'hammer', 1.90 ],
   [ 'P02', 'nail',   0.05 ],
   [ 'P03', 'pliers', 5.50 ],
   [ 'P04', 'cutter', 4.00 ]
);

my @orders = (
   [ 'C02', 'P03',   9 ],
   [ 'C02', 'P01',   2 ],
   [ 'C02', 'P02', 200 ],
   [ 'C05', 'P01',   3 ],
   [ 'C05', 'P02', 450 ],
   [ 'C01', 'P04',   4 ],
   [ 'C01', 'P03',   1 ]
 );
[download]

These arrays are a mirror image of what I can get from the database. Why it is not a good idea to do such a thing? For at least two reasons:

Because these records are in the right format for storage, but they are not suitable for being shown to a human being. When I look at a sales report, I want to see that Frank bought two hammers, not that 'C02' bought 2 'P01'. Therefore, I should get such data re-assembled to be human-readable. Every database engine can do that easily, while my program would make quite a lousy job at it.
Also because the number of raw records involved could be really high, and could needlessly occupy all the available memory of my client application, while the database could be asked to return just those pieces of information that I want.

The database engine can return results in a more friendly format, without any additional programming effort from my side. Just ask in a polite SQL statement and the database will combine the tables to return a list of orders or the total amount by each customer.

SELECT cust.name AS customer, prod.name AS product,
price, qty, qty*price AS total
FROM order
    INNER JOIN customer cust ON (cust.ID = custID)
    INNER JOIN product prod ON (prod.ID = prodID)
+------------------------------------------+
|                query results             |
+----------+---------+-------+-----+-------+
| customer | product | price | qty | total |
+----------+---------+-------+-----+-------+
| Frank    | pliers  |  5.50 |   9 | 49.50 |
| Frank    | hammer  |  1.90 |   2 |  3.80 |
| Frank    | nail    |  0.05 | 200 | 10.00 |
| Sue      | hammer  |  1.90 |   3 |  5.70 |
| Sue      | nail    |  0.05 | 450 | 22.50 |
| Joe      | cutter  |  4.00 |   5 | 20.00 |
| Joe      | pliers  |  5.50 |   1 |  5.50 |
+----------+---------+-------+-----+-------+

SELECT cust.name AS customer, SUM(qty*price) AS total
FROM order
    INNER JOIN customer cust ON (cust.ID = custID)
    INNER JOIN product prod ON (prod.ID = prodID)
GROUP BY customer
+------------------+
|  query results   |
+----------+-------+
| customer | total |
+----------+-------+
| Frank    | 63.30 |
| Sue      | 28.20 |
| Joe      | 25.50 |
+----------+-------+
[download]

The above data could be represented in your program by a hash of lists:

my %orders_by_customer = (
'Frank' => [ { product => 'pliers',  qty =>   9 },
             { product => 'hammer',  qty =>   2 },
             { product => 'nail',    qty => 200 }
           ] ,
'Sue'   => [ { product => 'hammer',  qty =>   3 },
             { product => 'nail',    qty => 450 }
           ] ,
'Joe'   => [ { product => 'cutter',  qty =>   5 },
             { product => 'pliers',  qty =>   1 }
           ] 
);
[download]

Transforming the data from the database into this common Perl structure is fairly simple, thanks to a well known idiom:

my %orders_by_customer = ();
while (my $href = $sth->fetchrow_hashref()) {
    my %order = (
        'product' => $href->{'product'}, 
        'qty' => $href->{qty}
   );
   push @{$orders_by_customer{$href->{'customer'}}}, \%order;
}
[download]

Another cause for distress is for the programmer who understands some of the problems coming from the relational model, but fails to understand that each result from the database is just one table, which we can see as a list of lists or a list of hashes, but it is a single entity. Whether it is a query from one table or the joined result of multiple tables, the result from the database is always one bi-dimensional array.
Who is used to desktop database applications, maybe the ones with a colorful GUI, has been exposed to the idea that tables can be "opened" and "edited", and then each record seen on the screen is the actual value in the database. Not so with client/server databases. Even if you are using an application that says "open table", what you get on your screen is the result of a query, just a copy of the actual data, and to modify it you have to issue an SQL command.
The beginner may suffer from the apparent dichotomy of having a permanent external data structure that often does not match with the one used for internal purposes. However, bearing in mind that the external data structure is not our business, since we can ask the database engine to return the most suitable result, we can concentrate on the internal part, considering the database interaction just a method to fill in our variables.

Number

Databases can hold huge amounts of records. It should not come as a surprise to us. However, we sometimes forget that also database queries can give back a large number of records.
This fact could be neglected when we build an application and we use a test database that is likely to be empty or to hold just a handful of records. What could seem to work flawlessly in a database with just 100 records and three users might become disastrously inefficient when loaded with ten million records and thirty thousand users.
Database engines can filter your requests and give you an answer surprisingly quickly, provided that you asked the right thing.
Unless you have a very good reason, asking for one million records at once is unlikely to be "the right thing".

More database trouble

SQL dialects

SQL is an abstract concept. You can't learn practical SQL without associating it with a specific database. And then you realize that what you are using is a database specific dialect, which looks like the mythical ANSI SQL, but it is very rarely implemented as such. It is something like learning English. There is no "standard English" anywhere, only dominant dialects, spread throughout the world by the force of Hollywood, the BBC and the CNN. Depending on which stream you were exposed, you will say "pavement" or "sidewalk" and you'll wear a "vest" under or over your shirt.
Similarly, if you have learned the Oracle syntax for a left join, you might look suspiciously at the MySQL equivalent.
The standard ANSI SQL provides many common chunks of the language, making you cope with most of the day-by-day cases. Knowing which parts of SQL implemented in your database of choice will comply with the standard is not only academic knowledge. It is a good investment towards portability and simplicity. If you use only standard constructs, you are more likely to find an answer when you get into trouble. If you are using only database-specific idioms, then you either are a wizard or you have to find one when things go wrong.

Wrapper modules

The DBI is the official interface between Perl and databases. It comes on the assumption that you know what you want, and you also know the underlying database, its design and how to deal with it. It may seem harsh, but such is life. Database programming does not have shortcuts. You may find some module that will help you to save some keystrokes. Some modules will handle many aspects of the database complexity for you. Some will ever promise to make database programming easier to understand, or to free you from the burden of database design. Don't fall into that trap. A DBI wrapper could be useful to speed up your development process or to help you handle the complexity of the database schemas. But none of them will replace your understanding of the subject. If you use a wrapper that promise you to deal with INNER, LEFT and RIGH JOINs, which you can't fully understand, then you are trusting an unknown piece of code to do something of which you have a limited command.
There are way too many database modules in the CPAN. Some of them are extremely valuable, provided that you understand what they are doing. Some are just a smart engine to save you some typing at the price of crippling the performance of your application and your flexibility in choosing the the most appropriate solution.
Like many other technological gadgets, DBI wrappers could be of help for the beginner, but they will keep the beginner in the lower end of the professional growth. Many people say that the DBI has a steep learning curve. That is not true. If you are not familiar with any relational client/server database, then the DBI will be as difficult as learning your first foreign language. However, if you know how a relational database works, especially if you know how a client/server engine reacts to your requests, then you can learn the DBI by evaluating similarities and differences with the general concepts that you already have in your bag of tricks.
The DBI is a tool. You can use it well if you know your trade. If you lack general database understanding, using the DBI will be like driving in the dark with your headlights off, and no cute module can protect you from an incoming disaster.

Note: Don't get me wrong. I am not against DBI wrappers. Actually, I appreciate some of them very much, like DBIx::DWIW. I just prefer to be in control. Others may feel differently. I don't impose my view on anybody and I keep an open mind about which wrappers could be useful for my needs.

Who's the client?

I said that client/server database applications have their specific annoyances, compared to stand-alone or desktop databases. One more source of confusion arises when we are dealing with a database server and a web server. Who is the client here?
Although the ultimate client of the web server is the user sitting in front of a browser, the client for the database is always a web server process. This is true even when the web client had to provide her username and password for database access. The web server gets the access information, does the talking with the database and returns an answer to the client.
It's important to remember this fact whenever there is a connection problem with the database. Unless username or password are wrong, the problem doesn't lie with the ultimate client (the web client), but with the web server. Testing the database connection with the appropriate user information outside the web server will usually tell you if you were providing your script the right parameters.

Final advice

It's said that the only good advice you can get comes from your own experience, especially your own mistakes. While it is true that I cannot magically transfer to you my database programming experience, I can at least give you my summary of what I feel is its positive outcome. You may decide if it is useful and eventually give it a try.

Enhance your skills

Know your database. First and foremost, you need to know what is behind the scenes. The database you are interacting with should not have any mysteries for you. At the very least, you must be sure that what you want to achieve through your script is exactly what you would do with a direct interaction with the database.
Know Perl idioms. Most of your dialogs with databases are achieved through intermediate data structures that you use to send data to the database and get results from it. You should feel confident in the usage of any idioms involved in your tasks. Avoid the cargo cult temptation of using something you have seen but you don't understand. Read the documentation about that particular idiom you need, and test it with a script that uses only that idiom and nothing else. Search the examples in The Monastery and read the explanations.
Know the intermediate tools (e.g.: DBI, CGI, HTML::Template, TT, Mason, etc). If these tools are your choice for developing your applications, you should reach a deep knowledge of their interface. You must be confident in using them, and be sure that you are either using the most recommended methods or have a very good reason if you don't.

About all of the above, you may safely ignore features that you are sure you'll never need, but you must not neglect any of the features you need to use. For those ones, you must know everything is available in the docs. The time you spend learning how those features work will result in less time debugging your scripts later.

Separation

If you consider all the elements involving database programming, the best piece of advice I can give you is to test such elements separately. When you face a problem, you should at least be able to attribute the problem to the appropriate layer: the script body, the web server interface, the DBI interaction, or whatever other module you may be using for your purpose.

Test your subs in a vacuum. It is a good practice to isolate each function of your scripts or modules, and test them with an isolated script, to be sure that the function is doing exactly what you want. See chromatic's article on testing and The Joy of Test for more on this point. Also a tour of Super Search on this subject would be a good investment. update See also this node (thanks, grinder) for a practical implementation of a testing strategy
Test your DB queries outside Perl. Before blaming your Perl scripts, be sure that the SQL query you want to run is correct. Test it from a stand-alone database client and see if you get the expected result. If your query is broken, there is no point in insisting with the script.
Test your web server without Perl scripts. As for database exchange, also the services of your web server could not be what you expect. Before trying your complex CGI/DBI application, be sure that your Apache can serve static HTML pages and simple CGI programs such as a trivial shell script. When all of the above works, you can try with a very basic Perl/CGI script. Once you are sure that the basic functionality is in order, it's time to try your application, again -whenever possible - testing the different subs separately.

Database related advice

The soundest advice I can give you is "Design before using". While this is true for all computer programming, it has a special meaning for database, because it implies some knowledge of the relational model.¹

Normalizing. It is the most important factor. It has been discussed here before (Migrating a 1NF table to multiple 2NF tables and Database normalization the easier way for instance) and there are places where you can look. I won't go into details here. But let me tell you that using a relational database without normalizing is like towing a truck with a Ferrari.
Primary key. Every table in your database must have a primary key. If you have a performance problem that you can't identify, check for primary keys first. If one or more of your tables lack a primary key, the problem is (at least partially) there.
Indexes on foreign keys. All foreign keys should be indexed. Your JOINs will work much faster if you remember and apply this simple rule. If you find that you have too many foreign keys in one table, then perhaps you should go back to the normalization process.

¹ Or the Object oriented model, if your database of choice supports it. Just don't throw everything into a table without careful analysis.

Common pitfalls

In database programming, you can shoot yourself in the foot in several ways. Even taking into consideration all the above wisdom and applying all the advice, your application can collapse by means of some silly overlooking. I will list here the most common ones, because I have seen them in too many SOPW requests.

List of hashes are a commonly needed structure, either for your own purposes or to accommodate the requests of other modules. Recent versions of the DBI can return these structures directly (fetchall_hashref, selectall_hashref). However, if for any reason you have to use an older release, or other applications needs force you to use another method, you can use the idiom mentioned above, in the paragraph about "structure".
The difficult part is to remember that, in order to assign another hash element to the array, the hash must be lexically scoped inside the loop block, while the array must be either global or lexically scoped outside the block. If we don't have a new hash for each iteration, we will create an array where all the records are the same and the values are the ones of the last hash.
Quoting. A more insidious problem comes from quoting. The typical pitfall goes like this. You have tested your SQL query with a stand-alone client, and it works.
```
 
    my $surname = 'Jones';
    my $query= 
        qq{SELECT name, surname FROM employees WHERE surname = $surnam
+e };
    my $sth = $dbh->prepare($query);
    $sth->execute();
[download]
```
When you execute that query through the DBI, you get an error, saying that the 'Jones' column was not found or a statement to that effect. What happened? You forgot that text column should be quoted. The value of your $surname variable was passed to the SQL query as a bareword.
You may be tempted to solve the problem by changing your query snippet into WHERE surname = '$surname', but it will break when your surname is "O'Reilly".
The DBI offers two solutions to this problem. One is the $dbh->quote($string) method, which will quote your string according to your database's rules. The second one, which I heartily recommend, is the binding mechanism through placeholders. Your query should become:
```
 
    my $query= qq{SELECT name, surname FROM employees WHERE surname = 
+? };
    my $sth = $dbh->prepare($query);
    $sth->execute('Jones');
[download]
```
The "?" is the placeholder, and the DBI knows that it should be replaced by the value passed to the execute method. The DBI will automatically take care of quoting the variables passed through placeholders, thus relieving you from an implementation headache.

That's all, folks!

That was all I could do without crossing the boundary between article and book.
I hope I managed to shed some light on a few dark, neglected aspects of database programming.
The bottom line is:
To make a better application, plan in advance, know your tools, and let the database engine do its job.
Any further contribution will be welcome.

update 4-May-2002. Fixed some typos.

 _  _ _  _  
(_|| | |(_|><
 _|

Back to Meditations