XML stream processing

dug has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monastics.

I've recently been working with XML::SAX::Machines (specifically XML::SAX::ByRecord), setting up a stream parser for large collections of XML documents.

Essentially, the stream looks like:

<Container>
<Doc>
  <content>stuff</content>
</Doc>
<Doc>
  <content>different stuff</content>
</Doc>
...
</Container>
[download]

I need to be able to grab everything between (and including) <Doc> and </Doc> as it comes through the stream, and treat it like its own "Document".

All of the examples for XML::SAX::ByRecord that I've looked at showed how to write *filters* that process as I've described. None that I have seen (probably a problem with my eyesight, not the documentation) have explained how to work with each of these "Documents" in the stream as its own isolated chunk of content so that one can process it independently of the filter.

Below is my the code that I've come up with to handle the task that I've explained above. I can't help thinking that it's a bit of a kludge. What is a more elegant way to deal with this type of stream processing?

Thanks in advance,

dug

#!/usr/bin/perl

use warnings;
use strict;
$|++;

use XML::SAX::Machines qw( :all );

my $output_handle; # global stream output container

##
# callback for end_document event.
my $write_hook = sub {
  my $self = shift;
  my $current_doc = $output_handle; # get contents of output buffer
  $output_handle = '';              # clear buffer for next doc
  ## process current doc
  process_doc( $current_doc );
};

my $filter = EndDocumentAction->new(end_hook => $write_hook);

my $machine = Pipeline(
  ByRecord( $filter ),
  \$output_handle,
);

$machine->parse_file( \*DATA );

sub process_doc {
  my $content = shift;
  # do something interesting
  print $content, "\n";
}

package EndDocumentAction;
use base qw( XML::SAX::Base );

sub new {
  my ($class, %args) = @_;
  my $self = {};
  $self->{End_Hook}    = $args{end_hook}; # install callback for end_d
+ocument
  $self->{start_counter} = 0;
  bless $self, $class;
  return $self;
}

sub end_document {
  my $self = shift;
  my $callback = $self->{End_Hook};
  $self->$callback();
}

1;

__END__
<Stream>
<Doc>
<foo>hey man</foo>
</Doc>
<Doc>
<bar>hey man, how's it goin'?</bar>
</Doc>
<Doc>
<baz>pretty right on.</baz>
</Doc>
</Stream>
[download]

Comment on XML stream processing Select or Download Code

Replies are listed 'Best First'.

Re: XML stream processing
by PodMaster (Abbot) on Nov 15, 2002 at 14:45 UTC

XML::Twig

use XML::Twig;
my $t = XML::Twig->new( 
    twig_roots => {
        Doc => \&mutilate_doc,
    },
); 

$t->parse(\*DATA);

sub mutilate_doc {
    my( $t, $doc)= @_;
    $doc->print;
    print "\n",'x'x69,"\n";
}

__END__
<Stream>
<Doc>
<foo>hey man</foo>
</Doc>
<Doc>
<bar>hey man, how's it goin'?</bar>
</Doc>
<Doc>
<baz>pretty right on.</baz>
</Doc>
</Stream>
[download]

mirod

http://www.xmltwig.com/

IEEE

Besides XML::Twig, my favorite ~~markup~~ sgml-type language processing module is HTML::TokeParser::Simple :)

____________________________________________________
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]
[d/l]


go ahead... be a heretic
	PerlMonks