Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Processing Two XML Files in Parallel

by Tanktalus (Canon)
on Jul 21, 2011 at 23:42 UTC ( [id://916017]=note: print w/replies, xml ) Need Help??


in reply to Processing Two XML Files in Parallel

The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

sub twig_iterator { my $file = shift; my $cb = Coro::rouse_cb; my $twig = XML::Twig->new( twig_handlers => { elem => sub { $cb->(elem => @_) } otherelem => sub { $cb->(otherelem => @_) } }, ); my $done; # $cb->() rouses with no parameters. async { shift->parse(); $cb->() } $twig; sub { Coro::rouse_wait($cb); # will return the parameters received by $c +b above } } my $itA = twig_iterator($fileA); my $itB = twig_iterator($fileB); while (1) { # if array has no items, it's done parsing, otherwise: # [0] == elem name (hardcoded in above) # [1..$#array] == items passed in by XML::Twig to the callback my @A = $itA->(); my @B = $itB->(); # compare? }
I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

Replies are listed 'Best First'.
Re^2: Processing Two XML Files in Parallel
by Anonymous Monk on Jul 22, 2011 at 00:45 UTC
      use strict; use warnings; use XML::LibXML::Reader qw( :types ); sub new { my $class = shift; return bless({ reader => XML::LibXML::Reader->new(@_), elem_depth => 0, buf => '', }, $class); } sub get_next { my ($self) = @_; my $reader = $self->{reader}; for (;;) { return () if $reader->read() != 1; if ($reader->nodeType() == XML_READER_TYPE_TEXT) { if ($self->{elem_depth} && $reader->depth() == $self->{elem_d +epth} + 1) { $self->{buf} .= $reader->value(); } } elsif ($reader->nodeType() == XML_READER_TYPE_ELEMENT) { if ($reader->name() eq 'elem') { $self->{elem_depth} = $reader->depth(); } } elsif ($reader->nodeType() == XML_READER_TYPE_END_ELEMENT) { if ($reader->name() eq 'elem') { return substr($self->{buf}, 0, length($self->{buf}), ''); } } } } { my $reader1 = __PACKAGE__->new(location => "file1.xml"); my $reader2 = __PACKAGE__->new(location => "file2.xml"); for (;;) { my $text1 = $reader1->get_next(); my $text2 = $reader2->get_next(); last if !defined($text1) && !defined($text2); die if !defined($text1); die if !defined($text2); process_data($text1, $text2); } }

      Assumes all elem elements are "interesting" ones, not just the ones found under the root. Easy to change, though.

      Output left to the user. May I suggest XML::Writer since it keeps next to nothing in memory.

Re^2: Processing Two XML Files in Parallel
by tedv (Pilgrim) on Jul 22, 2011 at 13:20 UTC
    I concede that the difficulty of processing these two files suggests to me that something has gone very wrong with the input specifications. And since XML is part of that, I naturally assume it's the fault of XML. I might try to get them to change the specification such that each line is generally well formed XML, but cannot contain any new lines. Then just do a standard double line reader.


    -Ted

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://916017]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2024-04-23 20:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found