http://qs321.pair.com?node_id=915997

tedv has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script that needs to process two XML files in parallel. It needs to take element #1 from file A and element #1 from file B, and output a new element into file C. Then it performs the same operation on element #2, and so on. As a very simple example:


Input file A:
<doc> <elem>A</elem> <elem>B</elem> <elem>C</elem> </doc>
Input file B:
<doc> <elem>1</elem> <elem>5</elem> <elem>10</elem> </doc>
Output file C:
<doc> <elem>A</elem> <elem>BBBBB</elem> <elem>CCCCCCCCCC</elem> </doc>

The catch is that the files are very large, so you cannot parse them all into memory at once. And sadly the XML::Parser interface seems to require parsing the entire first file, handling all callbacks, before you can invoke a call to the parsing of the second file.

Now if this was just a simple text file, the code would be pretty simple. It looks something like this:

# Open both input files open A, "<$file_a" or die "Unable to open $file_a: $!\n"; open B, "<$file_b" or die "Unable to open $file_b: $!\n"; # Process the files parallel while (1) { # Read the lines my $a = <A>; my $b = <B>; # Good coders would check and warn if one entry was defined and the +other # was not, but this is just an example, so you should be happy you e +ven # get comments. last if !defined $a || !defined $b; # Process the output print data_transform($a, $b); } close A; close B;

But because it's XML, everything is more painful. Does anyone know of what might work? Someone suggested XML::Twig, but I'm still reading the documentation to make sure the internal implementation doesn't prohibit this from working.


-Ted

Replies are listed 'Best First'.
Re: Processing Two XML Files in Parallel
by ikegami (Patriarch) on Jul 21, 2011 at 22:19 UTC
    use strict; use warnings; use XML::LibXML qw( ); die "Usage" if @ARGV != 3; my $parser = XML::LibXML->new(); my @counts; { my $doc = $parser->parse_file($ARGV[1]); my $root = $doc->documentElement(); @counts = map $_->textContent, $root->findnodes('elem'); } { my $doc = $parser->parse_file($ARGV[0]); my $root = $doc->documentElement(); for my $node ($root->findnodes('elem')) { die "Not enough counts" if !@counts; $node->appendText( $node->textContent() x (shift(@counts) - 1) ); } print $doc->toFile($ARGV[2]); } die "Too many counts" if @counts;

    Or if counts of 0 are acceptable:

    my $new_text = $node->textContent() x shift(@counts); $node->removeChild($_) for $node->findnodes('text()'); $node->appendText($new_text);

    Tested.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Processing Two XML Files in Parallel
by Tanktalus (Canon) on Jul 21, 2011 at 23:42 UTC

    The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

    I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

    As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

    sub twig_iterator { my $file = shift; my $cb = Coro::rouse_cb; my $twig = XML::Twig->new( twig_handlers => { elem => sub { $cb->(elem => @_) } otherelem => sub { $cb->(otherelem => @_) } }, ); my $done; # $cb->() rouses with no parameters. async { shift->parse(); $cb->() } $twig; sub { Coro::rouse_wait($cb); # will return the parameters received by $c +b above } } my $itA = twig_iterator($fileA); my $itB = twig_iterator($fileB); while (1) { # if array has no items, it's done parsing, otherwise: # [0] == elem name (hardcoded in above) # [1..$#array] == items passed in by XML::Twig to the callback my @A = $itA->(); my @B = $itB->(); # compare? }
    I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

        use strict; use warnings; use XML::LibXML::Reader qw( :types ); sub new { my $class = shift; return bless({ reader => XML::LibXML::Reader->new(@_), elem_depth => 0, buf => '', }, $class); } sub get_next { my ($self) = @_; my $reader = $self->{reader}; for (;;) { return () if $reader->read() != 1; if ($reader->nodeType() == XML_READER_TYPE_TEXT) { if ($self->{elem_depth} && $reader->depth() == $self->{elem_d +epth} + 1) { $self->{buf} .= $reader->value(); } } elsif ($reader->nodeType() == XML_READER_TYPE_ELEMENT) { if ($reader->name() eq 'elem') { $self->{elem_depth} = $reader->depth(); } } elsif ($reader->nodeType() == XML_READER_TYPE_END_ELEMENT) { if ($reader->name() eq 'elem') { return substr($self->{buf}, 0, length($self->{buf}), ''); } } } } { my $reader1 = __PACKAGE__->new(location => "file1.xml"); my $reader2 = __PACKAGE__->new(location => "file2.xml"); for (;;) { my $text1 = $reader1->get_next(); my $text2 = $reader2->get_next(); last if !defined($text1) && !defined($text2); die if !defined($text1); die if !defined($text2); process_data($text1, $text2); } }

        Assumes all elem elements are "interesting" ones, not just the ones found under the root. Easy to change, though.

        Output left to the user. May I suggest XML::Writer since it keeps next to nothing in memory.

      I concede that the difficulty of processing these two files suggests to me that something has gone very wrong with the input specifications. And since XML is part of that, I naturally assume it's the fault of XML. I might try to get them to change the specification such that each line is generally well formed XML, but cannot contain any new lines. Then just do a standard double line reader.


      -Ted
Re: Processing Two XML Files in Parallel
by mirod (Canon) on Jul 22, 2011 at 07:08 UTC

    One way to do this is to use XML::Twig and Coro: have one thread parse the first input file and an other one parse the other one. Pass control between the 2 threads, after each elem has been parsed:

    #!/usr/bin/perl use strict; use warnings; use Coro; use XML::Twig; use Test::More; use Perl6::Slurp; use autodie qw(open); my $INPUT_A = "input_A.xml"; # input file A my $INPUT_B = "input_B.xml"; # input file B my $OUTPUT = "output.xml"; my $EXPECTED = "expected.xml"; # output file C open( my $out, '>', $OUTPUT); my $times; # global, maybe Coro has a better way to pass it around but + I don't know it my $t1= XML::Twig->new( twig_handlers => { elem => \&main_elem }, keep +_spaces => 1); my $t2= XML::Twig->new( twig_handlers => { elem => \&get_times }); # to get the numbers first, before the letters, t2 will be parsed in t +he main loop async { $t1->parsefile( $INPUT_A); }; $t2->parsefile( $INPUT_B); print {$out} "\n"; # missing \n for some reason $t1->flush( $out); print {$out} "\n"; # missing \n for some reason close $out; is( slurp( $OUTPUT), slurp( $EXPECTED), 'the one test'); done_testing(); sub main_elem { my( $t, $elem)= @_; $elem->set_text( $elem->text x $times); $t->flush( $out); cede; } sub get_times { my( $t, $elem)= @_; $times= $elem->text; $t->purge; cede; }

    You will need to check that memory is indeed freed after each record. It should be OK, but I don't know exactly how Coro deals with memory, I had never used it before today.

    Thank you for asking this and making me look into the problem. And to whoever mentioned Coro yesterday in the CB. This is something I had wanted to do for a long time, but I had always deferred it since I did not really need it for work. Overall it was pretty painless though, the Coro intro is quite well written.

    update: also, I should have read Tanktalus answer, above, since he obviously knows Coro a lot better than I do. I am still happy I answered though, at least I learned something.

Re: Processing Two XML Files in Parallel
by Jenda (Abbot) on Jul 21, 2011 at 23:55 UTC

    Depends on the exact format of your XML files, but maybe XML::Records could help. It allows you to ask for the ext "record" from the XML so you can ask for the first from one file, the first from the second, then again from the first and again from the second and so forth.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Re: Processing Two XML Files in Parallel
by ambrus (Abbot) on Jul 24, 2011 at 20:39 UTC

    I agree with the previous replies in that running two XML parsers each in its own Coros seems to be a good way to do this. However, I'd like to show a solution not using Coro, just for the challenge of it.

    This solution uses the stream parsing capability of XML::Parse. The documentation of XML::Twig states that you probably should not use with XML::Twig and is untested.

    We read the input XML files in small chunks (20 bytes here for demonstration, but should be much more than that in the real application). In each loop iteration, we read from the file that's behind the other, that is, the one from which we have read less items so far. This way, the files remain in sync even if the length of the items differ. Once the xml parser has found an item from both files, we pair these and print an item with the two texts concatenated.

    The warnings I have commented out show that the files are indeed read in parallel. I also hope that chunks of the file we have processed don't remain in memory, and there are no other bugs, but then you should of course verify this if you want to use this code in production.

    use warnings; use strict; use Encode; use XML::Twig; binmode STDERR, ":encoding(iso-8859-2)"; our(@XMLH, @xmln, @tw, @pa, @eof, @it, $two, $roo); for my $n (0 .. 1) { $xmln[$n] = shift || ("a1.xml", "a2.xml")[$n]; open $XMLH[$n], "<", $xmln[$n] or die "error open xml${n}: $!"; $tw[$n] = XML::Twig->new; $tw[$n]->setTwigHandler("item", sub { my($twt, $e) = @_; my $t = $e->text; #warn " "x(24+8*$n), "${n}g|$t|\n"; push @{$it[$n]}, $t; $twt->purge; }); $pa[$n] = $tw[$n]->parse_start; $it[$n] = []; } $two = XML::Twig->new(output_filter => "safe", pretty_print => "nice") +; $roo = XML::Twig::Elt->new("doc"); $two->set_root($roo); while (1) { my $n = undef; my $itq = 1e9999; for my $j (0 .. 1) { if (!$eof[$j] && @{$it[$j]} <= $itq) { $n = $j; $itq = @{$it[$j]}; } } if (!defined($n)) { last; } if (read $XMLH[$n], my $b, 20) { #my $bp = decode("iso-8859-2", $b); $bp =~ y/\r\n/./; #warn " "x(8+8*$n), "${n}r|$bp|\n"; $pa[$n]->parse_more($b); } else { eof($XMLH[$n]) or die "error reading xml${n}"; $pa[$n]->parse_done; $eof[$n]++; } my $eo; while (@{$it[0]} && @{$it[1]}) { my $i0 = shift @{$it[0]}; my $i1 = shift @{$it[1]}; $eo = XML::Twig::Elt->new("item", "$i0 $i1"); $eo->paste_last_child($roo); #warn "p|$i0 $i1|\n"; } if (defined($eo)) { $two->flush_up_to($eo); } } for my $n (0 .. 1) { if (my $c = @{$it[$n]}) { warn "warning: xml${n} has $c additional items"; } } $two->flush; #warn "all done"; __END__

    Update 2013-04-23: RFC: Simulating Ruby's "yield" and "blocks" in Perl may be related.

      Fiddling with Coro or reading in blocks when all you need is a pull style parser seems a bit silly. Even though it is a nice exercise.

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

Re: Processing Two XML Files in Parallel
by Anonymous Monk on Jul 22, 2011 at 11:33 UTC
    I keep wondering if you could use an SQLite database (file...) here. Kind of like a tied-hash only better. I do not know how many elements in these massive files actually change from one run to the next; nor how many are in common. But maybe you could capture data from first one then the other into an SQLite table, which, since it is just a file, requires no server setup. Determine what differences actually exist, then use these to update or to rebuild file C. The overall strategy of using two massive XML files needs to be reviewed carefully, either by you or by your managers or both.
Re: Processing Two XML Files in Parallel
by GrandFather (Saint) on Jul 28, 2011 at 08:42 UTC

    You might like to try XML::TreePuller. If your sample is a fair representation of your problem then something like the following ought turn the trick for you:

    use strict; use warnings; use XML::TreePuller; use XML::Writer; my $fileA = <<XML; <doc> <elem>A</elem> <elem>B</elem> <elem>C</elem> </doc> XML my $fileB = <<XML; <doc> <elem>1</elem> <elem>5</elem> <elem>10</elem> </doc> XML # Open both input files open my $inA, "<", \$fileA; open my $inB, "<", \$fileB; my $readerA = XML::TreePuller->new (IO => $inA); my $readerB = XML::TreePuller->new (IO => $inB); my $writer = XML::Writer->new (DATA_MODE => 1); # Process the files parallel $readerA->iterate_at('/doc/elem' => 'short'); $readerB->iterate_at('/doc/elem' => 'short'); $writer->startTag ('doc'); while ((my $elmtA = $readerA->next ()) && (my $elmtB = $readerB->next +())) { my $nameA = $elmtA->name (); my $nameB = $elmtB->name (); next if $nameA ne 'elem'; die "Element mismatch: $nameA ne $nameB\n" if $nameA ne $nameB; $writer->dataElement ($nameA, $elmtA->text () x $elmtB->text ()); } $writer->endTag(); close $inA; close $inB;

    Prints:

    <doc> <elem>A</elem> <elem>BBBBB</elem> <elem>CCCCCCCCCC</elem> </doc>
    True laziness is hard work
Re: Processing Two XML Files in Parallel
by Logicus (Initiate) on Jul 21, 2011 at 21:29 UTC

    Does each element have a line to itself or is the data multiline? As in, if we read line say 123 from file A, will line 123 in file B be the correct line to do the processing with?

    If that is the case, then you could just read both files a line at a time, and use a simple regex to get the value out of the <elem> wrapper;

    my ($a,$b,$value_a,$value_b); while (1) { $a = <A>; $b = <B>; if ($a =~ m/<elem>(.*?)</elem>/) { $value_a = $1; } if ($b =~ m/<elem>(.*?)</elem>/) { $value_b = $1; } last if !defined $value_a || !defined $value_b; print data_transform($value_a, $value_b); }

    I'm sure better perl adepts than me could write it better/faster, but I think that would work if the files have a line for line concurrency.

      So you like catch phrases, uh?
      Let me tell you something:
      In about 97% of the time, parsing XML with regexes is the root of all evil. The remaining 3% are left for one-time, quick & dirty scripts and maybe some special cases (where you can assure the XML will stay exactly like that).
      Let me tell you why:
      The creator of the XML to parse might change it. All elements might be on one line. Maybe there will be some empty lines between the tags. Maybe the elem tags will get attributes in the future. In all cases your script will suddenly stop to work, although the actual content you want didn't change. And somebody has to fix it quickly. In the end it's more work then just doing it right from the beginning, and potentially you annoyed a customer and your boss.

      That's how experienced programmers think. Because they know that things like that happen.
      You not only posted a quick & dirty solution, you even bashed someone for posting a clean and correct solution. A quick & dirty solution is ok (although it would be nice to comment that it depends on the exact XML format), and you actually got some ++ for it, but then bashing someone elses correct solution is just infantile.

      A reply falls below the community's threshold of quality. You may see it by logging in.