comment on

I agree with the previous replies in that running two XML parsers each in its own Coros seems to be a good way to do this. However, I'd like to show a solution not using Coro, just for the challenge of it.

This solution uses the stream parsing capability of XML::Parse. The documentation of XML::Twig states that you probably should not use with XML::Twig and is untested.

We read the input XML files in small chunks (20 bytes here for demonstration, but should be much more than that in the real application). In each loop iteration, we read from the file that's behind the other, that is, the one from which we have read less items so far. This way, the files remain in sync even if the length of the items differ. Once the xml parser has found an item from both files, we pair these and print an item with the two texts concatenated.

The warnings I have commented out show that the files are indeed read in parallel. I also hope that chunks of the file we have processed don't remain in memory, and there are no other bugs, but then you should of course verify this if you want to use this code in production.

use warnings; use strict;
use Encode;
use XML::Twig;
binmode STDERR, ":encoding(iso-8859-2)";
our(@XMLH, @xmln, @tw, @pa, @eof, @it, $two, $roo);
for my $n (0 .. 1) {
    $xmln[$n] = shift || ("a1.xml", "a2.xml")[$n];
    open $XMLH[$n], "<", $xmln[$n] or die "error open xml${n}: $!";
    $tw[$n] = XML::Twig->new;
    $tw[$n]->setTwigHandler("item", sub {
        my($twt, $e) = @_;
        my $t = $e->text;
        #warn " "x(24+8*$n), "${n}g|$t|\n";
        push @{$it[$n]}, $t;
        $twt->purge;
    });
    $pa[$n] = $tw[$n]->parse_start;
    $it[$n] = [];
}
$two = XML::Twig->new(output_filter => "safe", pretty_print => "nice")
+;
$roo = XML::Twig::Elt->new("doc");
$two->set_root($roo);
while (1) {
    my $n = undef; my $itq = 1e9999;
    for my $j (0 .. 1) {
        if (!$eof[$j] && @{$it[$j]} <= $itq) {
            $n = $j; $itq = @{$it[$j]};
        }
    }
    if (!defined($n)) {
        last;
    }
    if (read $XMLH[$n], my $b, 20) {
        #my $bp = decode("iso-8859-2", $b); $bp =~ y/\r\n/./;
        #warn " "x(8+8*$n), "${n}r|$bp|\n";
        $pa[$n]->parse_more($b);
    } else {
        eof($XMLH[$n]) or die "error reading xml${n}";
        $pa[$n]->parse_done;
        $eof[$n]++;
    }
    my $eo;
    while (@{$it[0]} && @{$it[1]}) {
        my $i0 = shift @{$it[0]};
        my $i1 = shift @{$it[1]};
        $eo = XML::Twig::Elt->new("item", "$i0 $i1");
        $eo->paste_last_child($roo);
        #warn "p|$i0 $i1|\n";
    }
    if (defined($eo)) {
        $two->flush_up_to($eo);
    }
}
for my $n (0 .. 1) {
    if (my $c = @{$it[$n]}) {
        warn "warning: xml${n} has $c additional items";
    }
}
$two->flush;
#warn "all done";
__END__
[download]

Update 2013-04-23: RFC: Simulating Ruby's "yield" and "blocks" in Perl may be related.

In reply to Re: Processing Two XML Files in Parallel by ambrus
in thread Processing Two XML Files in Parallel by tedv

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.