comment on

The easiest way is to just bring everything into memory and deal with it. In CB, you said that you don't have a TB of RAM, so I'm assuming these files are GB+ in size. At which point, I'm wondering WTF they're doing in XML :-)

I also don't quite follow how you want to do the comparison. Is it just the text of certain nodes? The text of all nodes? XML::Twig allows you to flush the in-memory representation, freeing up all the memory used thus far, but whether you can do that really depends on how you're thinking of doing the comparison. With line-record-based text, it's fairly obvious. With XML, the definition of "record" is much less clear in general - only you know the specifics.

As I said in CB, I'd consider turning XML::Twig on its head with Coro. It looks like you should be able to turn XML::Parser on its head, too. But, either way, you'll likely have to turn them on their heads. Warning, the following code is COMPLETELY untested. Channels may be required instead of rouse_wait'ing all the time.

sub twig_iterator
{
  my $file = shift;
  my $cb   = Coro::rouse_cb;
  my $twig = XML::Twig->new(
    twig_handlers => {
      elem => sub { $cb->(elem => @_) }
      otherelem => sub { $cb->(otherelem => @_) }
    },
  );
  my $done;

  # $cb->() rouses with no parameters.
  async { shift->parse(); $cb->() } $twig;

  sub {
    Coro::rouse_wait($cb); # will return the parameters received by $c
+b above
  }
}

my $itA = twig_iterator($fileA);
my $itB = twig_iterator($fileB);

while (1)
{
  # if array has no items, it's done parsing, otherwise:
  # [0] == elem name (hardcoded in above)
  # [1..$#array] == items passed in by XML::Twig to the callback
  my @A = $itA->();
  my @B = $itB->();

  # compare?
}
[download]

I'm not sure if this properly deals with end-of-files, but I think so. Like I said, UNTESTED. Be sure to have proper twig flushing (I think the [1] items will be the twig reference) so that you don't use all your RAM (if this isn't a problem, then don't use this at all - just suck the whole files in!).

In reply to Re: Processing Two XML Files in Parallel by Tanktalus
in thread Processing Two XML Files in Parallel by tedv

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.