Re: Processing Two XML Files in Parallel

The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

sub twig_iterator
{
  my $file = shift;
  my $cb   = Coro::rouse_cb;
  my $twig = XML::Twig->new(
    twig_handlers => {
      elem => sub { $cb->(elem => @_) }
      otherelem => sub { $cb->(otherelem => @_) }
    },
  );
  my $done;

  # $cb->() rouses with no parameters.
  async { shift->parse(); $cb->() } $twig;

  sub {
    Coro::rouse_wait($cb); # will return the parameters received by $c
+b above
  }
}

my $itA = twig_iterator($fileA);
my $itB = twig_iterator($fileB);

while (1)
{
  # if array has no items, it's done parsing, otherwise:
  # [0] == elem name (hardcoded in above)
  # [1..$#array] == items passed in by XML::Twig to the callback
  my @A = $itA->();
  my @B = $itB->();

  # compare?
}
[download]

I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

Comment on Re: Processing Two XML Files in Parallel Select or Download Code

Replies are listed 'Best First'.
Re^2: Processing Two XML Files in Parallel by Anonymous Monk on Jul 22, 2011 at 00:45 UTC
XML::LibXML::Reader also provides a memory-conservative iterator interface.	[reply]
Re^3: Processing Two XML Files in Parallel by ikegami (Patriarch) on Jul 25, 2011 at 23:16 UTC
use strict; use warnings; use XML::LibXML::Reader qw( :types ); sub new { my $class = shift; return bless({ reader => XML::LibXML::Reader->new(@_), elem_depth => 0, buf => '', }, $class); } sub get_next { my ($self) = @_; my $reader = $self->{reader}; for (;;) { return () if $reader->read() != 1; if ($reader->nodeType() == XML_READER_TYPE_TEXT) { if ($self->{elem_depth} && $reader->depth() == $self->{elem_d +epth} + 1) { $self->{buf} .= $reader->value(); } } elsif ($reader->nodeType() == XML_READER_TYPE_ELEMENT) { if ($reader->name() eq 'elem') { $self->{elem_depth} = $reader->depth(); } } elsif ($reader->nodeType() == XML_READER_TYPE_END_ELEMENT) { if ($reader->name() eq 'elem') { return substr($self->{buf}, 0, length($self->{buf}), ''); } } } } { my $reader1 = __PACKAGE__->new(location => "file1.xml"); my $reader2 = __PACKAGE__->new(location => "file2.xml"); for (;;) { my $text1 = $reader1->get_next(); my $text2 = $reader2->get_next(); last if !defined($text1) && !defined($text2); die if !defined($text1); die if !defined($text2); process_data($text1, $text2); } } [download] Assumes all `elem` elements are "interesting" ones, not just the ones found under the root. Easy to change, though. Output left to the user. May I suggest XML::Writer since it keeps next to nothing in memory.	[reply] [d/l] [select]
Re^2: Processing Two XML Files in Parallel by tedv (Pilgrim) on Jul 22, 2011 at 13:20 UTC
I concede that the difficulty of processing these two files suggests to me that something has gone very wrong with the input specifications. And since XML is part of that, I naturally assume it's the fault of XML. I might try to get them to change the specification such that each line is generally well formed XML, but cannot contain any new lines. Then just do a standard double line reader. -Ted	[reply]


Your skill will accomplish what the force of many cannot
	PerlMonks