Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Processing Two XML Files in Parallel

by ambrus (Abbot)
on Jul 24, 2011 at 20:39 UTC ( [id://916445]=note: print w/replies, xml ) Need Help??


in reply to Processing Two XML Files in Parallel

I agree with the previous replies in that running two XML parsers each in its own Coros seems to be a good way to do this. However, I'd like to show a solution not using Coro, just for the challenge of it.

This solution uses the stream parsing capability of XML::Parse. The documentation of XML::Twig states that you probably should not use with XML::Twig and is untested.

We read the input XML files in small chunks (20 bytes here for demonstration, but should be much more than that in the real application). In each loop iteration, we read from the file that's behind the other, that is, the one from which we have read less items so far. This way, the files remain in sync even if the length of the items differ. Once the xml parser has found an item from both files, we pair these and print an item with the two texts concatenated.

The warnings I have commented out show that the files are indeed read in parallel. I also hope that chunks of the file we have processed don't remain in memory, and there are no other bugs, but then you should of course verify this if you want to use this code in production.

use warnings; use strict; use Encode; use XML::Twig; binmode STDERR, ":encoding(iso-8859-2)"; our(@XMLH, @xmln, @tw, @pa, @eof, @it, $two, $roo); for my $n (0 .. 1) { $xmln[$n] = shift || ("a1.xml", "a2.xml")[$n]; open $XMLH[$n], "<", $xmln[$n] or die "error open xml${n}: $!"; $tw[$n] = XML::Twig->new; $tw[$n]->setTwigHandler("item", sub { my($twt, $e) = @_; my $t = $e->text; #warn " "x(24+8*$n), "${n}g|$t|\n"; push @{$it[$n]}, $t; $twt->purge; }); $pa[$n] = $tw[$n]->parse_start; $it[$n] = []; } $two = XML::Twig->new(output_filter => "safe", pretty_print => "nice") +; $roo = XML::Twig::Elt->new("doc"); $two->set_root($roo); while (1) { my $n = undef; my $itq = 1e9999; for my $j (0 .. 1) { if (!$eof[$j] && @{$it[$j]} <= $itq) { $n = $j; $itq = @{$it[$j]}; } } if (!defined($n)) { last; } if (read $XMLH[$n], my $b, 20) { #my $bp = decode("iso-8859-2", $b); $bp =~ y/\r\n/./; #warn " "x(8+8*$n), "${n}r|$bp|\n"; $pa[$n]->parse_more($b); } else { eof($XMLH[$n]) or die "error reading xml${n}"; $pa[$n]->parse_done; $eof[$n]++; } my $eo; while (@{$it[0]} && @{$it[1]}) { my $i0 = shift @{$it[0]}; my $i1 = shift @{$it[1]}; $eo = XML::Twig::Elt->new("item", "$i0 $i1"); $eo->paste_last_child($roo); #warn "p|$i0 $i1|\n"; } if (defined($eo)) { $two->flush_up_to($eo); } } for my $n (0 .. 1) { if (my $c = @{$it[$n]}) { warn "warning: xml${n} has $c additional items"; } } $two->flush; #warn "all done"; __END__

Update 2013-04-23: RFC: Simulating Ruby's "yield" and "blocks" in Perl may be related.

Replies are listed 'Best First'.
Re^2: Processing Two XML Files in Parallel
by Jenda (Abbot) on Jul 25, 2011 at 22:00 UTC

    Fiddling with Coro or reading in blocks when all you need is a pull style parser seems a bit silly. Even though it is a nice exercise.

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://916445]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-03-28 17:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found