Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I agree with the previous replies in that running two XML parsers each in its own Coros seems to be a good way to do this. However, I'd like to show a solution not using Coro, just for the challenge of it.

This solution uses the stream parsing capability of XML::Parse. The documentation of XML::Twig states that you probably should not use with XML::Twig and is untested.

We read the input XML files in small chunks (20 bytes here for demonstration, but should be much more than that in the real application). In each loop iteration, we read from the file that's behind the other, that is, the one from which we have read less items so far. This way, the files remain in sync even if the length of the items differ. Once the xml parser has found an item from both files, we pair these and print an item with the two texts concatenated.

The warnings I have commented out show that the files are indeed read in parallel. I also hope that chunks of the file we have processed don't remain in memory, and there are no other bugs, but then you should of course verify this if you want to use this code in production.

use warnings; use strict; use Encode; use XML::Twig; binmode STDERR, ":encoding(iso-8859-2)"; our(@XMLH, @xmln, @tw, @pa, @eof, @it, $two, $roo); for my $n (0 .. 1) { $xmln[$n] = shift || ("a1.xml", "a2.xml")[$n]; open $XMLH[$n], "<", $xmln[$n] or die "error open xml${n}: $!"; $tw[$n] = XML::Twig->new; $tw[$n]->setTwigHandler("item", sub { my($twt, $e) = @_; my $t = $e->text; #warn " "x(24+8*$n), "${n}g|$t|\n"; push @{$it[$n]}, $t; $twt->purge; }); $pa[$n] = $tw[$n]->parse_start; $it[$n] = []; } $two = XML::Twig->new(output_filter => "safe", pretty_print => "nice") +; $roo = XML::Twig::Elt->new("doc"); $two->set_root($roo); while (1) { my $n = undef; my $itq = 1e9999; for my $j (0 .. 1) { if (!$eof[$j] && @{$it[$j]} <= $itq) { $n = $j; $itq = @{$it[$j]}; } } if (!defined($n)) { last; } if (read $XMLH[$n], my $b, 20) { #my $bp = decode("iso-8859-2", $b); $bp =~ y/\r\n/./; #warn " "x(8+8*$n), "${n}r|$bp|\n"; $pa[$n]->parse_more($b); } else { eof($XMLH[$n]) or die "error reading xml${n}"; $pa[$n]->parse_done; $eof[$n]++; } my $eo; while (@{$it[0]} && @{$it[1]}) { my $i0 = shift @{$it[0]}; my $i1 = shift @{$it[1]}; $eo = XML::Twig::Elt->new("item", "$i0 $i1"); $eo->paste_last_child($roo); #warn "p|$i0 $i1|\n"; } if (defined($eo)) { $two->flush_up_to($eo); } } for my $n (0 .. 1) { if (my $c = @{$it[$n]}) { warn "warning: xml${n} has $c additional items"; } } $two->flush; #warn "all done"; __END__

Update 2013-04-23: RFC: Simulating Ruby's "yield" and "blocks" in Perl may be related.


In reply to Re: Processing Two XML Files in Parallel by ambrus
in thread Processing Two XML Files in Parallel by tedv

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (1)
As of 2024-04-25 00:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found