Re^3: Are you looking at XML processing the right way? (merge)

in reply to Re2: Are you looking at XML processing the right way? (merge)
in thread is XML too hard?

The point about "merge sort" is about callbacks versus iterators not about XML. Compare:

# Code using iterator:
    while(  <INPUT>  ) {
        process_line( $_ );
    }

# Code using call-back:
    File::ProcessLines(
        \*INPUT,
        sub { process_line($_) }
    );
[download]

and you see that the differences appear rather superficial and psychological.

but it would seem that merge-sorting would be very simple with N streams using callbacks

No, it is impossible. Using callbacks means that you have to completely process one stream before you get control back to process another stream. You can't process 2 streams at once with callbacks, much less N streams.

Consider this code:

    # A merge sort:
    my $r1= <$i1>;
    my $r2= <$i2>;
    while(  ! eof($i1)  &&  ! eof($i2)  ) {
        my $cmp= $r1 cmp $r2;
        print $cmp le 0 ? $r1 : $r2;
        $r1= <$i1>   if  $cmp le 0;
        $r2= <$i2>   if  0 le $cmp;
    }
    # ...
[download]

Now rewrite the above using File::ProcessLines and callbacks. You can't. It is impossible. To do it requires continuations which Perl doesn't have. Let's try:

    File::ProcessLines(
        $i1,
        sub {   # sub1
            my $r1= shift(@_);
            File::ProcessLines(
                $i2,
                sub {   # sub2
                    my $r2= shift(@_);
                    if(  $r1 lt $r2  ) {
                        return_from_sub1_but_not_from_sub2;
                    # ...
                }
            );
        }
    );
[download]

So we can go as far as getting the first two records of each stream. But to get the second record of the first stream requires us to return from the first callback which won't happen until the entire second stream is processed.

The point is that callbacks are not just harder to use, they are also fundamentally less flexible. They require processing be done in an extremely restricted linear order and make it very unnatural to even share state between the callbacks.

Iterators are more flexible. Iterators that can seek are even more flexible. A random access data structure is still more flexible.

Now, for an XML example. Assume there is some web site that discusses Perl. Assume also that this site has a chatterbox and you can get the last 10 lines of chatter from a XML ticker. You fetch chatter, wait a while and fetch it again. Now you want to combine those two. You could certainly use a merge sort for that. But that is impossible using callbacks.

There are other ways you could merge such data. In this case, the data is only 10 lines so not being able to use merge sort isn't a huge problem. Also, the data should only overlap in one chunk, so you could even use callbacks to do this merging but it would be much more difficult than if you used a more flexible interface (and it would be impossible to make it deal well with some exceptions).

Let's also assume that there are several people who have written their own chatter archiving systems. But each system has periods of down time for various reasons. Now you want to combine these archives to get as complete an archive as possible. They've each stored their data in different formats, of course. The obvious solution is to have each site send you their data in XML; it is nearly the canonical example for what XML is useful for. Now you have a case where a merge sort is important. But your XML parser only supports callbacks. So you are forced to convert each stream into something other than XML and then merge the new streams. What a waste.

Note that callbacks can also be used where the first call does not process the entire stream before returning. This gives you a bit of a combination between callbacks and iterators (you 'iterate' to the next chunk which causes one or more of your callbacks to be called). So you can iterate whatever the "chunks" are but are forced to process each chunk using callbacks.

In summary: Yes, callbacks are fundamentally one of the least flexible interfaces you can provide. They make it easy for the module writer to provide the interface and make it hard for the module user to use the interface. And it is not just a matter of "getting used to" using callbacks.

- tye

In Section Meditations