Hello, fellow monastics.
I've recently been working with
XML::SAX::Machines
(specifically
XML::SAX::ByRecord), setting up a stream parser for
large collections of XML documents.
Essentially, the stream looks like:
<Container>
<Doc>
<content>stuff</content>
</Doc>
<Doc>
<content>different stuff</content>
</Doc>
...
</Container>
I need to be able to grab everything between (and including)
<Doc> and </Doc> as it comes through the stream, and treat it like
its own "Document".
All of the examples for XML::SAX::ByRecord that I've
looked at showed how to write *filters* that process as I've described. None
that I have seen (probably a problem with my eyesight, not the
documentation) have explained how to work with each of these "Documents" in the
stream as its own isolated chunk of content so that one can process it
independently of the filter.
Below is my the code that I've come up with to handle the task that I've
explained above. I can't help thinking that it's a bit of a kludge. What is a more elegant way to deal with this type of stream processing?
Thanks in advance,
dug
#!/usr/bin/perl
use warnings;
use strict;
$|++;
use XML::SAX::Machines qw( :all );
my $output_handle; # global stream output container
##
# callback for end_document event.
my $write_hook = sub {
my $self = shift;
my $current_doc = $output_handle; # get contents of output buffer
$output_handle = ''; # clear buffer for next doc
## process current doc
process_doc( $current_doc );
};
my $filter = EndDocumentAction->new(end_hook => $write_hook);
my $machine = Pipeline(
ByRecord( $filter ),
\$output_handle,
);
$machine->parse_file( \*DATA );
sub process_doc {
my $content = shift;
# do something interesting
print $content, "\n";
}
package EndDocumentAction;
use base qw( XML::SAX::Base );
sub new {
my ($class, %args) = @_;
my $self = {};
$self->{End_Hook} = $args{end_hook}; # install callback for end_d
+ocument
$self->{start_counter} = 0;
bless $self, $class;
return $self;
}
sub end_document {
my $self = shift;
my $callback = $self->{End_Hook};
$self->$callback();
}
1;
__END__
<Stream>
<Doc>
<foo>hey man</foo>
</Doc>
<Doc>
<bar>hey man, how's it goin'?</bar>
</Doc>
<Doc>
<baz>pretty right on.</baz>
</Doc>
</Stream>
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.