http://qs321.pair.com?node_id=267261

lhoward has asked for the wisdom of the Perl Monks concerning the following question:

I am interfacing with an existing application that provides an XML event stream over a TCP socket. If I establish a IO::Socket connection without hooking it up to XML::Parser I see the data flow by in real-time. However, when I hook up my socket to XML::Parser, it seems to be buffering IO and then processing a whole bunch at once. Since I want to process the stream more-or-less in realtime; this is undesirable behavior from my point of view. In my experimentation it seems to be buffering data until it gets 32k, then processing that additional stream data. I've checked the docs and module source code but can't find any way to set this buffering size manually. I've encluded example code (with lots of my error checking and other fluff not relevant to the problem at hand) removed:
use strict; use XML::Parser; use IO::Socket::INET; my $sock=IO::Socket::INET->new( PeerAddr => '127.0.0.1', PeerPort => 6537); my $parser=XML::Parser->new( Style => 'Stream', Handlers => { Start => \&handle_elem_start } ); $parser->parse($sock); sub handle_elem_start{ my ($expat,$name,&atts)=@_; print "in element \"$name\", at byte ".$expat->current_byte()." in s +tream\n"; }
Is there anyway to set the buffer size? Or to force XML::Parser to process its buffer? Might one of the other XML parsers on CPAN serve me better for the task at hand? Any ideas on how to avoid the unwanted buffering? L

Replies are listed 'Best First'.
Re: XML::Parser Streams performing undesired buffering
by Thelonius (Priest) on Jun 19, 2003 at 21:37 UTC
    In Expat/Expat.xs, line 374 in my version reads:
    cnt = perl_call_method("read", G_SCALAR);
    Change this to:
    cnt = perl_call_method("sysread", G_SCALAR);
    Then "make install". That should do ya (on Unix-like systems, at least), although I would like to reiterate that a well-formed XML document has a single top-level element, so it can't really be considered parsed until you get to the end.

    Another way to do this is to subclass IO::Socket::INET to override the read method.

    Update: Here's how you would do the subclassing thing:

    use strict; use XML::Parser; use IO::Socket::INET; { package DebufSocket; our @ISA = qw(IO::Socket::INET); sub read { my $self = shift; $self->sysread(@_); } } sub handle_elem_start { my ($expat,$name,$atts) = @_; print "in element \"$name\", at byte ".$expat->current_byte()." in s +tream\n"; } my $sock=DebufSocket->new( PeerAddr => '162.134.173.177', PeerPort => 6537) or die "socket: $!\n"; my $parser=XML::Parser->new( Style => 'Stream', Handlers => { Start => \&handle_elem_start } ); $parser->parse($sock);
      Thanks for the great suggestions. Thelonius subclassing of IO::Socket::INET solution is the one I ended up going with and it works beautifully.

      Thanks,

      L

Re: XML::Parser Streams performing undesired buffering
by BrowserUk (Patriarch) on Jun 19, 2003 at 19:43 UTC

    I think the significant value can be found in the file:expat.xs (part of the XML::Parser package) which has this at line 35

    #define BUFSIZE 32768

    Which suggests that you might be able to rebuild just the xs part of the package to adjust the buffersize.

    That said, there is no guarentee that this will work or would fix your problem. I've never tried using XML::Parser in stream mode, but I wonder how effective it is for real-time parsing of streaming XML-like data? It appears to geared more to parsing fully complient XML documents rather than XML-like streams.

    If its important for your application to decode XML-like markup in a timely manner, you may have to consider using something like XML::Parser::Lite instead. This would mean reading the socket yourself and passing the markup to the parser in chunks as you read it. The interface is (as the name suggests) like a cut down version of the XML::Parser interface, but it uses pure perl and regexes to do th parsing which should mena that you would have a lot more control over the process.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller


Re: XML::Parser Streams performing undesired buffering
by bobn (Chaplain) on Jun 19, 2003 at 19:21 UTC
    My guess is that you wouldn't find it in the perl module source because the heavy lifting is done in the expat code - which is c code.

    taking a very cursory look at my install, I see a file:

    /usr/local/src/XML-Parser-2.31/expat-1.95.4/xmlwf/xmlfile.c

    which contains lines:
    #ifdef _DEBUG #define READ_SIZE 16 #else #define READ_SIZE (1024*8) #endif
    I'm not certain that this is the issue, especialy since the size given seems smaller than what you've found, but this might be where you need to start.

    --Bob Niederman, http://bob-n.com