comment on

Hello, fellow monastics.

I've recently been working with XML::SAX::Machines (specifically XML::SAX::ByRecord), setting up a stream parser for large collections of XML documents.

Essentially, the stream looks like:

<Container>
<Doc>
  <content>stuff</content>
</Doc>
<Doc>
  <content>different stuff</content>
</Doc>
...
</Container>
[download]

I need to be able to grab everything between (and including) <Doc> and </Doc> as it comes through the stream, and treat it like its own "Document".

All of the examples for XML::SAX::ByRecord that I've looked at showed how to write *filters* that process as I've described. None that I have seen (probably a problem with my eyesight, not the documentation) have explained how to work with each of these "Documents" in the stream as its own isolated chunk of content so that one can process it independently of the filter.

Below is my the code that I've come up with to handle the task that I've explained above. I can't help thinking that it's a bit of a kludge. What is a more elegant way to deal with this type of stream processing?

Thanks in advance,

dug

#!/usr/bin/perl

use warnings;
use strict;
$|++;

use XML::SAX::Machines qw( :all );

my $output_handle; # global stream output container

##
# callback for end_document event.
my $write_hook = sub {
  my $self = shift;
  my $current_doc = $output_handle; # get contents of output buffer
  $output_handle = '';              # clear buffer for next doc
  ## process current doc
  process_doc( $current_doc );
};

my $filter = EndDocumentAction->new(end_hook => $write_hook);

my $machine = Pipeline(
  ByRecord( $filter ),
  \$output_handle,
);

$machine->parse_file( \*DATA );

sub process_doc {
  my $content = shift;
  # do something interesting
  print $content, "\n";
}

package EndDocumentAction;
use base qw( XML::SAX::Base );

sub new {
  my ($class, %args) = @_;
  my $self = {};
  $self->{End_Hook}    = $args{end_hook}; # install callback for end_d
+ocument
  $self->{start_counter} = 0;
  bless $self, $class;
  return $self;
}

sub end_document {
  my $self = shift;
  my $callback = $self->{End_Hook};
  $self->$callback();
}

1;

__END__
<Stream>
<Doc>
<foo>hey man</foo>
</Doc>
<Doc>
<bar>hey man, how's it goin'?</bar>
</Doc>
<Doc>
<baz>pretty right on.</baz>
</Doc>
</Stream>
[download]

In reply to XML stream processing by dug

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.