Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

SOLVED: XML Parsing from URL

by jshank (Initiate)
on Jun 26, 2015 at 15:02 UTC ( [id://1132165] : perlquestion . print w/replies, xml ) Need Help??

jshank has asked for the wisdom of the Perl Monks concerning the following question:

SOLVED in the update at the end. I need to parse a continuous XML stream coming from a device via http. The problem I run into is I can't find a way to convince PerlSAX to take a URL as a source. There's no good way to "download" the XML and hand it off to the XML parser because the download is never going to finish (continuous stream).

# This is the URL that streams alerts such as motion detection. Docume +ntation at my($url) = "http://".$user.":".$password."@".$ip.":".$port."/ISAPI/Eve +nt/notification/alertStream"; #start an instance of SAX parser for each monitor my $handler = SAXAlertStreamHandler->new(); my $parser = XML::Parser::PerlSAX->new( Handler => $handler ); my %parser_args = (Source => {SystemId => $url}); $parser->parse(%parser_args); exit; Couldn't open http://user:password@ +ion/alertStream: No such file or directory at /usr/local/share/perl/5.18.2/XML/Parser/P line 146.

Update: Based on the great information found here, I did determine that preprocessing the data was required. LWP has the ability to process with a handler. As a really nice side-effect, the LWP handler seems to automatically call the handler for each part of the multpart stream so I don't need to set the read_size_hint. For even better multipart http handling, see Alexander's module. Below is my working code

#!/usr/bin/perl use XML::Twig; use warnings; use strict; use LWP; my $url = 'http://username:password@'; my $browser = LWP::UserAgent->new(); my $twig = new XML::Twig( twig_handlers => { EventNotificationAlert => \&AlertStreamHandler +} ); my $response = $browser->get( $url, ':content_cb' => \&raw_handler, ':read_size_hint' => 1024, ); sub raw_handler { my ( $data, $response ) = @_; unless ( $data =~ /^--boundary/ ) { $twig->parse($data); #print $data; } } sub AlertStreamHandler { my ( $twig, $eventAlert ) = @_; my $ip = $eventAlert->first_child('ipAddress')->text; my $eventType = $eventAlert->first_child('eventType')->text; print "IP: " . $ip . "\n"; print "Event: " . $eventType . "\n"; $twig->purge; # delete the twig so far +. Not sure if this is needed. }

Replies are listed 'Best First'.
Re: XML Parsing from URL
by dasgar (Priest) on Jun 26, 2015 at 16:03 UTC

    Doing a search on for "xml stream", found these two modules that might help you out: XML::Stream and XML::Atom::Stream. Perhaps one of those modules might work better for what you're trying to do or might help you figure out how handle/parse streaming XML data.

Re: XML Parsing from URL
by 1nickt (Canon) on Jun 26, 2015 at 15:42 UTC
    Looks like the parser_args source hash should specify bytestream , see docs, sorry, on a train ... can hardly type ....
Re: XML Parsing from URL
by Anonymous Monk on Jun 26, 2015 at 21:51 UTC

    That's an interesting question; unfortunately I don't have enough time to test it right now but I think that XML::Twig might be able to help you, since it processes documents piece by piece, and it's supposed to be able to read from an IO::Handle object. I just don't (yet) know of an HTTP client that provides one...

      I went with XML::Twig and got a little further (thanks!) unfortunately it's still upset that the XML isn't quite "well-formed" XML example:

      --boundary Content-Type: application/xml; charset="UTF-8" Content-Length: 478 <EventNotificationAlert version="1.0" xmlns=" +ver10/XMLSchema"> <ipAddress></ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>0</activePostCount> <eventType>videoloss</eventType> <eventState>inactive</eventState> <eventDescription>videoloss alarm</eventDescription> </EventNotificationAlert> --boundary Content-Type: application/xml; charset="UTF-8" Content-Length: 514 <EventNotificationAlert version="1.0" xmlns=" +ver10/XMLSchema"> <ipAddress></ipAddress> <portNo>80</portNo> <protocol>HTTP</protocol> <macAddress>c4:2f:90:00:00:00</macAddress> <channelID>1</channelID> <dateTime>2015-06-24T19:37:22--8:00</dateTime> <activePostCount>1</activePostCount> <eventType>VMD</eventType> <eventState>active</eventState> <eventDescription>Motion alarm</eventDescription> <DetectionRegionList> </DetectionRegionList> </EventNotificationAlert>

        This "boundary" stuff and the two "Content"-lines look like HTTP multipart POST data (RFC2388) to me. On the other hand, HTTP POST data should also have a "Content-Disposition" header with a name attribute after each boundary.

        Is this real data or shortened? Where does the data come from?

        In a HTTP context, I would expect some library to parse the HTTP data and provide them in a more accessible form. For example, using the classic CGI module, each XML document would be available by its parameter name using the param() or upload() methods.


        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        Is a very interesting problem but difficult to experiment with.. anyway you can try to use twig_roots or you can try to preprocess your input.
        In fact I see a declared lenght in the header: will be possible to read only what is declared in Content-Length and pass this chunk to XML::Twig to be processed.

        Maybe you can elaborate a specific XML::Twig question as new SOPW, the author of the module lurks here sometimes..

        UPDATE: you can also read this interesting article

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: XML Parsing from URL
by marinersk (Priest) on Jun 26, 2015 at 15:41 UTC

    You say continuous stream, which suggests to me the XML is never finished and is thus always malformed, in the strictest interpretation.

    If that's the case, you may need to either roll your own parse (no, not my first choice either) or write a pre-processer which sits in between the stream and the parsing module code which "trims off" the parts of the XML causing malformity, and feed the resulting subset XML to the parser, well-formed by its standards.

    Okay, earlier I couldn't type, now I can't read. Going to get more coffee. Sheesh.