Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
'ello folks,

I'm currently handling some XML documents that are too large to process in memory or to store on disk permanently. However, they fit well enough when gzip'ed (at about 15:1 compression). But when it comes time to process them, I would like to avoid first decompressing them fully. I thought I could solve the problem using IO::Zlib, which provides an interface much like IO::Handle; this would allow me to keep only portions of the decompressed text in memory at a time. And of course, XML::Twig is great for managing big XML documents (thanks mirod!), but it doesn't natively handle gzip'ed XML. But since XML::Parser::Expat and by extension, XML::Twig can take an IO::Handle as a document source, I thought I could string the two together. However, IO::Zlib doesn't actually inherit from IO::Handle, and XML::Parser::Expat demands that UNIVERSAL::isa($arg, 'IO::Handle') be true before it will treat the argument as a handle. I figured a simple workaround like this would work:

package IO::Handle::Zlib; use vars qw/ @ISA /; @ISA = qw/ IO::Zlib IO::Handle /;
which would allow me to replace my IO::Zlib objects with IO::Handle::Zlib's transparently. However, when I try this out, I come across the following error, courtesy of expat:
not well-formed (invalid token) at line 7213, column 3, byte 780490 at + /path/to/perl/lib/5.6.1/IP27-irix/XML/Parser.pm line 185
Now that's odd, since the decompressed file ends at line 7212, and is only 780487 bytes long. One might think the file is being decompressed past the original size of the document, but inserting print DUMP <$gz>; gives a file that is identical to the original (i.e., the angle-bracket read gives a file that is also 7212 lines and 780487 bytes long). So clearly, whatever the XS part of XML::Parser::Expat is doing with the IO::Handle is not what the angle-brackets are doing. And expat itself is working, since replacing
my $reader = new IO::Handle::Zlib; $reader->open( $compressed_filename, "rb" ) or croak "could not open $compressed_filename: $!";
with
my $reader = new IO::File; $reader->open( $uncompressed_filename, "r" );
eliminates the error.

Has anyone used IO::Zlib like this before? Is my IO::Handle::Zlib wrapper bogus? Anybody know how XS modules do IO::Handle reads, and why this doesn't work?

Thanks,

--athomason


In reply to IO::Zlib with XML::Twig/XML::Parser::Expat by athomason

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (9)
As of 2024-04-16 08:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found