Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

XML stream processing

by dug (Chaplain)
on Nov 15, 2002 at 05:47 UTC ( [id://213078]=perlquestion: print w/replies, xml ) Need Help??

dug has asked for the wisdom of the Perl Monks concerning the following question:

Hello, fellow monastics.

  I've recently been working with XML::SAX::Machines (specifically XML::SAX::ByRecord), setting up a stream parser for large collections of XML documents.

  Essentially, the stream looks like:

<Container> <Doc> <content>stuff</content> </Doc> <Doc> <content>different stuff</content> </Doc> ... </Container>

  I need to be able to grab everything between (and including) <Doc> and </Doc> as it comes through the stream, and treat it like its own "Document".


  All of the examples for XML::SAX::ByRecord that I've looked at showed how to write *filters* that process as I've described. None that I have seen (probably a problem with my eyesight, not the documentation) have explained how to work with each of these "Documents" in the stream as its own isolated chunk of content so that one can process it independently of the filter.

Below is my the code that I've come up with to handle the task that I've explained above. I can't help thinking that it's a bit of a kludge. What is a more elegant way to deal with this type of stream processing?

Thanks in advance,

  dug
#!/usr/bin/perl use warnings; use strict; $|++; use XML::SAX::Machines qw( :all ); my $output_handle; # global stream output container ## # callback for end_document event. my $write_hook = sub { my $self = shift; my $current_doc = $output_handle; # get contents of output buffer $output_handle = ''; # clear buffer for next doc ## process current doc process_doc( $current_doc ); }; my $filter = EndDocumentAction->new(end_hook => $write_hook); my $machine = Pipeline( ByRecord( $filter ), \$output_handle, ); $machine->parse_file( \*DATA ); sub process_doc { my $content = shift; # do something interesting print $content, "\n"; } package EndDocumentAction; use base qw( XML::SAX::Base ); sub new { my ($class, %args) = @_; my $self = {}; $self->{End_Hook} = $args{end_hook}; # install callback for end_d +ocument $self->{start_counter} = 0; bless $self, $class; return $self; } sub end_document { my $self = shift; my $callback = $self->{End_Hook}; $self->$callback(); } 1; __END__ <Stream> <Doc> <foo>hey man</foo> </Doc> <Doc> <bar>hey man, how's it goin'?</bar> </Doc> <Doc> <baz>pretty right on.</baz> </Doc> </Stream>

Replies are listed 'Best First'.
Re: XML stream processing
by PodMaster (Abbot) on Nov 15, 2002 at 14:45 UTC
    use XML::Twig, it's all powerful man!!!!
    use XML::Twig; my $t = XML::Twig->new( twig_roots => { Doc => \&mutilate_doc, }, ); $t->parse(\*DATA); sub mutilate_doc { my( $t, $doc)= @_; $doc->print; print "\n",'x'x69,"\n"; } __END__ <Stream> <Doc> <foo>hey man</foo> </Doc> <Doc> <bar>hey man, how's it goin'?</bar> </Doc> <Doc> <baz>pretty right on.</baz> </Doc> </Stream>
    Our very own mirod is the author of http://www.xmltwig.com/. He works for IEEE.

    Besides XML::Twig, my favorite markup sgml-type language processing module is HTML::TokeParser::Simple :)

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://213078]
Approved by ChemBoy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-25 21:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found