Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Avoiding compound data in software and system design

by metaperl (Curate)
on Apr 20, 2010 at 21:36 UTC ( [id://835894]=perlmeditation: print w/replies, xml ) Need Help??

That's like asking for the chocolate back after you've made chocolate milk!

This post concerns a tragedy in API design. While it is often convenient to have your API receive compound data, it is a bad idea.

What is compound data?

A compound datum is an apparently atomic data item that it really not atomic. Here are two examples of API tragedy:

DBI

DBI->connect("dbi:mysql:database=sakila;host=localhost;post=3306", $username, $password);
The first argument to connect is compound data. It is a single string argument, but it contains important subelements:
  1. dbi
  2. mysql
  3. database
  4. host
  5. port

so what?

well, take a look at register_db API, it expects you to supply things like "mysql" and "host" in separate parameters:
__PACKAGE__->register_db( driver => 'pg', database => 'my_db', host => 'localhost', username => 'joeuser', password => 'mysecret', );
So, what happens if you've been using DBI and you have all your connect data in configuration files with DSN strings and then you decide you want to start using Rose::DB::Object? You have to find some way of getting the chocolate back from chocolate milk --- you have to break down the compound data in the dsn in order to get out the sub-elements.

DBIx::DBH

DBIx::DBH was written long ago to address this issue. You supply the sub-elements of the dsn and it forms the compound data for you.

HTML::Zoom

HTML::Zoom is a promising push-style templating system developed by Matt Trout. Let's take a look at how you select an HTML tag with id equal to "hithere":
$zoom->select('#hithere');
So "hithere" is compound data. The octothorpe is shorthand for id and "hithere" is the value of an id... that's two things packed into a single scalar.

suggested non-compound top-level API call

$zoom->id('hithere');
and similar to class lookdowns, etc.

The HTML::Element look_down() method had an excellent non-compound approach.

That being said, the current API meshes well with the jQuery API and it also intuitive for web designers, so there are definitely some reasons for why it is OK.

Relational Database Design

I dont have the time to get into all the terrifying ways that people overload single columns with compound data

Directory Trees

If you see a directory structure like:
drwxrwxrwx ... bugs-old drwxrwxrwx ... bugs-new drwxrwxrwx ... bugs-closed
you have compound data, which you need to "normalize" via bugs/old, bugs/new, bugs/closed

conclusion

I hope this post helped someone. Typically people either know this and dont need to be told or they dont know it and dont care :)



The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development.

-- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"

Replies are listed 'Best First'.
Re: Avoiding compound data in software and system design
by JavaFan (Canon) on Apr 21, 2010 at 00:11 UTC
    OTOH, if you call it "serialized data", it sounds less scary, and most people will realize they do this often. (YAML, Data::Dumper, etc, are all examples of creators of compound aka serialized data).

    None of your examples have me get worried and regard compound data as something evil. Your article just says "Yup, compound data has its uses. At other times, it's not the right thing." But that's true for almost anything.

Re: Avoiding compound data in software and system design
by BrowserUk (Patriarch) on Apr 21, 2010 at 00:20 UTC

    Sorry, but this smacks of: I just got bitten by something, so now I'm gonna demonise it.

    1. Are hashes evil? They consists of keys and values.
    2. Floats? Exponent and characteristic.
    3. Integers? Magnitude and sign.
    4. Bytes? Many bits.
    5. Strings? ...
    6. Objects? ...

    I can't use my new vacuum cleaner in it's box, but I'm glad it came in one.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Howdy!

      Those are all red herrings.

      Hashes are not scalar data; they are a collection of values indexed by key. Hashes using the old sub-key thingy that predated references would be an example, but not because they are hashes.

      The scalar data types are usefully atomic. If you need to work with the sub-parts of the underlying representation, you get to disassemble them yourself. Strings, per se, are only compound insofar as you define the values to be so and need to work with individual parts. Objects, more or less by definition, *can* have numerous attributes, but the parts are explicit and individually addressable (for most sane implementations).

      I see the point; it needs to be applied judiciously.

      yours,
      Michael

        The string scalarDSN data type is usefully atomic. If you need to work with the sub-parts of the underlying representation, you get to disassemble them yourself.

        Seeing as you can swap in DSN for a scalar data type, what you said of scalar data types applies to DSNs as well.

        By your logic, the problem isn't the compoundness of DSNs, it's the lack or perceived lack of tools to manipulate DSNs.

      Sorry, but this smacks of: I just got bitten by something, so now I'm gonna demonise it
      Yes, it's called evolution. Intelligence is the ability to identify, formulate and resolve problems. So this post was made to identify and formulate a problem in hopes that it is not repeated. And yes, I did get bitten by the DBI API and now I have to go redo something so it works with Rose.

      Continuing, Let me present the definition of compound data to you once again:

      A compound datum is an apparently atomic data item that it really not atomic.
      Are hashes evil? They consists of keys and values.
      evil? You brought demons in the picture, not me. The point at hand is "apparently atomic". they are not apparently atomic. you dissected hashes into their parts yourself.

      Now, if instead of this hash:

      %a = (a => 1, b => 2);
      You did this: my $vals = "a:1,b:2" then you would have an apparently atomic data item that it really not atomic, because you would have to do string-twiddling to extract relevant subparts.
      Floats? Exponent and characteristic.
      Seems atomic to me. And the subparts you mention, can they be easily accessed/used?
      Integers? Magnitude and sign.
      or 32 bits (grin).
      my $int = Integer->new(magnitude => 12, sign => '+');
      ah, perfect decomposition!

      My post did not say it listed all examples of compound data. And if there are more, then fine. Besides, the focus was on software and system design, not language elements.

      Bytes? Many bits.
      Again, complex data is not 'compound data'. Compound is a specific term referring to a specific mistake in software and system design.
      Strings?
      Yes, they are complex, but only compound when mis-used.
      Objects? ...
      Yes, an object is atomic, not apparently atomic. It may have subparts, but each has a well-defined means of accessing/changing it.
      I can't use my new vacuum cleaner in it's box, but I'm glad it came in one.
      You are confusing a complex of objects with compound data. The vacuum cleaner's relation to the box was meaningful and useful. Packing multiple datums into a string is counter-productive to flexible software and system design as was demonstrated.



      The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development.

      -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"

        You are confusing a complex of objects with compound data.

        No I'm not. You are making an artificial separation where none exists.

        Take urls. These are both complex and compound. And simple.

        Whilst there are (many) modules like URI* that allow you to treat these as objects and access all their internal bits separately, the vast majority of modules that use urls as inputs (eg.LWP*), take them in their simple string form. Why?

        Because they do not care what is inside, and do not want to have to deal with it. For most applications of those latter modules, the user will be supplying a 'simple string', picked out of a text file (log file; html; whatever), and all they need or want to know is, can I reach it?

        If they had to tease apart the myriad forms of url/uri/urn formats in order to populate a ur* object in order to pass it to LWP*--that would promptly just stick all the bits back together again--it would be an entirely unnecessary waste of time & resources. Complexity without merit or benefit.

        Same goes for file systems entities. We pass open a string, not some kind of FileSystem::Object. Because for the most part, they are simply an opaque scalar entity we use. Not pick apart and fret over.

        And the same goes for your example of DBI data source names. At the DBI level, and below, they are simply opaque entities to be gathered and passed through uninspected. Requiring some kind of object be used for them would create unnecessary and useless complexity.

        They do not even have a consistent constitution. Your example breaks them down as dbi

        dbi mysql database host port

        And then as

        __PACKAGE__->register_db( driver => 'pg', database => 'my_db', host => 'localhost', usern +ame => 'joeuser', password => 'mysecret', );

        but you've lost two parts (dbi/port) and gained two parts (user/pass).

        And then you get something like DBD::WMI, which doesn't need and cannot use most of those--either set of 5. And DBD::SQLite that also has no use for most of those fields. And these came into being long after the DBI/DBD interfaces were designed and implemented.

        Rather than something to be "avoided", DBI's use of a string for the data source name is the sign of a well-though through, flexible interface. One that recognises that you cannot fit the world into labelled boxes, and that in many situations, there is no purpose in trying.

        You should be celebrating the vision and skill of those authors for designing an interface so flexible it can accommodate future developments without requiring constant re-writes as time passes and uses evolve. Not decrying them.

        Consider: Will your interfaces survive so long, so well?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        You did this: my $vals = "a:1,b:2" then you would have an apparently atomic data item that it really not atomic, because you would have to do string-twiddling to extract relevant subparts.

        I don't see why searching through an associative array stored as "a:1,b:2" makes the type not atomic when the example you used for an atomic type ({a=>1,b=>2}) is an associative array that requires searching through a list of buckets then through a linked list.

Re: Avoiding compound data in software and system design
by Jenda (Abbot) on Apr 21, 2010 at 11:26 UTC

    It's the same thing with compound data and database normalization. What's atomic/normalized in one situation is compound/denormalized in another. Even if the data look exactly the same. And you can't tell without context.

    Sure, you should stop and think about the level of atomicity at which to store some data, but there is no hard rule.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Avoiding compound data in software and system design
by PeterPeiGuo (Hermit) on Apr 21, 2010 at 06:56 UTC

    Very good topic! In the database world, everyone knows that one should not store "compound data" in a single column - at least, everyone is supposed to know.

    But people don't often talk about the same thing (at least not as explicit or as often) outside of the database world. Good reminder!

    Peter (Guo) Pei

Re: Avoiding compound data in software and system design
by JavaFan (Canon) on Apr 28, 2010 at 12:58 UTC
    So, what happens if you've been using DBI and you have all your connect data in configuration files with DSN strings and then you decide you want to start using Rose::DB::Object? You have to find some way of getting the chocolate back from chocolate milk --- you have to break down the compound data in the dsn in order to get out the sub-elements.
    First of all, it's not the fault of DBI that you decided to store DSN strings in your configuration file. If you had stored it as (for instance):
    driver = mysql host = localhost username = joe password = s3c41+ port = 3306
    it would trivial to construct a dsn, and to parameter list for Rose::Db::Object. You could also share the configuration file with applications written in a different language.

    Second, if you're switching your API from DBI to Rose::Db::Object, I'd think you'll have a change a lot of your code anyway. Parsing out a dsn string shouldn't be that much more work.

Re: Avoiding compound data in software and system design
by dmlond (Acolyte) on Apr 22, 2010 at 19:10 UTC

    I have to agree that this is really not as tragic as you make it out to be. There are severe tradeoffs between the use of intermediate objects designed for inter-package compatibility (DBI, and Rose::DB::Object), and the use of simple serialized strings.

    I think the example of 'open' is particularly instructive as a counterargument to your critique. Try opening a file in Java. You first have to create a Buffer object around a Reader object which wraps a File object. You will write this code millions of times in your lifetime, and you will always wonder why, especially if you use languages like perl, ruby, or python which allow you to just open a string path.

    Getting the different parts of the DBI connect string is not nearly as energetically expensive as getting the chocolate back out of chocolate milk. Nor is it too much work for people who use both DBI and Rose::DB::Object to store their connect strings as serialized YAML, JSON, etc. to be used by code to construct the arguments suitable for the context that they are to be used in. Now, if it is the case that there are more than 10 people out there that wish there was a compatibility layer on DBI that allowed it to take the same arguments as Rose::DB::Object (and there may very well be, so speak up if you are reading this), they should decide which of them wants to get involved with the DBI codebase to provide this functionality. It would probably not be that hard to override DBI connect to take the same connect params as Rose::DB::Object, nor would it be hard to create a separate CPAN module that can take a central hash argument (such as might be retrieved from a YAML, JSON, etc. serialized configuration file), and provide methods to construct dbi_connect_string, or rose_db_object_connect_params, etc.

Re: Avoiding compound data in software and system design
by ruoso (Curate) on Apr 22, 2010 at 10:13 UTC

    You know, while I agree that sometimes compound data is problematic (and I refer, for instance, to the way options are stored in wordpress database -- they are php serialized data). Sometimes you do want an atomic value in the API.

    The dsn parameter in the DBI API is a good example, it is a "connection URI" and it makes it simpler to use, the same way that typing a www URL is simpler then sending the individual parts by name... The problem is with modules that break the consistency by trying to split that data

    daniel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://835894]
Approved by Old_Gray_Bear
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (12)
As of 2024-04-23 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found