msalerno has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a script that will pull rrd xport data from a webserver. I'm using LWP::UserAgent as well as XML::LibXML::SAX to retrieve and parse the incoming data. I am using a LWP::UserAgent callback to pass the XML to the SAX parser. Some of the xports can be larger than 100megs, and it would be a waste of space to store all of the XML. The XML parser builds a data structure out of the XML data and returns it to main.

What I am having issues with is the best way to pass variables and objects around to these different handlers. Whenever I write a sub, I keep it self contained, all vars and objects worked on by that sub are explicitly passed to it and returned.

The first issues is passing the parser object to the LWP::UserAgent callback. The second issue is storing the data structure build by the SAX parser.

For the LWP::Useragent callback, I don't see a way to pass additional vars. SAX is so new to me that I'm not sure where to begin.

I'm open to all suggestions and critiques (especially with SAX).

Below is the working code (still pretty ugly) and a working xml example.

<xport> <meta> <start>1020611700</start> <step>300</step> <end>1020615600</end> <rows>14</rows> <columns>2</columns> <legend> <entry>out bytes</entry> <entry>in and out bits</entry> </legend> </meta> <data> <row><t>1020611700</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020612000</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020612300</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020612600</t><v>3.4113333333e+00</v><v>5.4581333333e+01</ +v></row> <row><t>1020612900</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020613200</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020613500</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020613800</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020614100</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020614400</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020614700</t><v>3.7333333333e+00</v><v>5.9733333333e+01</ +v></row> <row><t>1020615000</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020615300</t><v>3.4000000000e+00</v><v>5.4400000000e+01</ +v></row> <row><t>1020615600</t><v>NaN</v><v>NaN</v></row> </data> </xport>
#!/usr/bin/perl -w use strict; use warnings; use XML::SAX; use XML::SAX::ParserFactory; use Data::Dumper; $Data::Dumper::Sortkeys = 1; $Data::Dumper::Indent = 1; use LWP::UserAgent; my %data; $data{values} = (); my $factory = XML::SAX::ParserFactory->new; $XML::SAX::ParserPackage = "XML::LibXML::SAX::Better"; $factory->require_feature('http://xml.org/sax/features/namespaces'); # now we do the way we want, sending chunks: my $streamed_events; my $handler = EventRecorder->new(\$streamed_events); my $p = $factory->parser(Handler => $handler); my $epoch = time; my $url = 'http://localhost/rrd_compare.xml'; my $xml = httpgetxml($url); sub httpgetxml { my $url = shift; my $ua = LWP::UserAgent->new; my $request = HTTP::Request->new(GET => $url); my $stuff = $ua->request($request, \&parseXenXMLchunk); } sub parseXenXMLchunk{ my ($data, $res, $req) = @_; $p->parse_chunk($data); return 1; } print Dumper \%data; package EventRecorder; use strict; use base qw(XML::SAX::Base); sub new { my ($class, $outref) = @_; $$outref = ""; return bless { outref => $outref, }; } sub start_element { my ($self, $data) = @_; } sub characters { my $self = shift; my $text = shift; $self->{text} .= $text->{Data}; } sub end_element{ my $self = shift; my $data = shift; my $text = $self->get_text(); # To be cleaned up later $text =~ s/\n//g; $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /; my $local_name = $data->{LocalName}; if ($local_name eq "step"){ $data{$local_name} = $text; } elsif ($local_name eq "entry"){ push @{$data{datasource}}, $text; } elsif ($local_name eq "t"){ $data{lasttime} = $text; } elsif ($local_name eq "v"){ push @{$data{values}{$data{lasttime}}}, $text; } } sub get_text { my $self = shift; my $text = ''; if ( defined( $self->{text} ) ) { $text = $self->{text}; $self->{text} = ''; } return $text; } # XML::LibXML::SAX::Better an extended SAX handler by Djabberd project package XML::LibXML::SAX::Better; use strict; use vars qw($VERSION @ISA); $VERSION = '1.00'; use XML::LibXML; use XML::SAX::Base; use base qw(XML::SAX::Base); sub new { my ($class, @params) = @_; my $inst = $class->SUPER::new(@params); my $libxml = XML::LibXML->new; $libxml->set_handler( $inst ); $inst->{LibParser} = $libxml; # setup SAX. 1 means "with SAX" $libxml->_start_push(1); $libxml->init_push; return $inst; } sub parse_chunk { my ( $self, $chunk ) = @_; my $libxml = $self->{LibParser}; my $rv = $libxml->push($chunk); } sub finish_push { my $self = shift; return 1 unless $self->{LibParser}; my $parser = delete $self->{LibParser}; return eval { $parser->finish_push }; } 1;
Thanks

Replies are listed 'Best First'.
Re: Handler semantics
by eff_i_g (Curate) on Dec 08, 2010 at 19:19 UTC
    Have you considered XML::Twig? It handles large documents well.
      Regardless of the XML interface, what is the best approach to pass vars to the handlers? A twig object would make life a little easier, but it wouldn't be the complete answer. Plus, SAX is already working.
        I just added the parser ref to the $ua object, and realized that my DOM parsers needed to be updated. The ouput is being sent back to main in $streamed_events.