Re^3: XML Parsing from URL

in reply to Re^2: XML Parsing from URL
in thread SOLVED: XML Parsing from URL

This "boundary" stuff and the two "Content"-lines look like HTTP multipart POST data (RFC2388) to me. On the other hand, HTTP POST data should also have a "Content-Disposition" header with a name attribute after each boundary.

Is this real data or shortened? Where does the data come from?

In a HTTP context, I would expect some library to parse the HTTP data and provide them in a more accessible form. For example, using the classic CGI module, each XML document would be available by its parameter name using the param() or upload() methods.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Comment on Re^3: XML Parsing from URL Select or Download Code

Replies are listed 'Best First'.
Re^4: XML Parsing from URL by jshank (Initiate) on Jul 04, 2015 at 23:38 UTC
Thanks Alexander. This is real data coming from a Hikvision IP camera. There is a goo.gl shortlink in the original code if you want to look at the documentation for the interface. LWP was able to properly handle the multipart and then I parse out the XML portion.	[reply]
Re^5: XML Parsing from URL by afoken (Chancellor) on Jul 05, 2015 at 20:35 UTC
I had some free time to play with LWP and server push. In fact, it is quite easy to make LWP call a callback whenever a complete document from a multipart container is received. It does not matter if the part is received in one chunk or in many chunks, and it does not matter if one chunk of data contains one or more parts. The trick is to know that LWP uses HTTP::Response, which inherits from HTTP::Message. And `HTTP::Message` contains everything needed to handle multipart messages, both for server push and for multipart POST requests. This is my code. Feel free to use it. Make it a CPAN module if you like it. MultipartFilter.pm The module that sits on top of LWP::UserAgent and extracts the individual documents from multipart containers. Nothing special happens for single part documents. package MultipartFilter; use v5.12; use warnings; sub hookInto # $userAgent, onStart => sub { ... }, onDocument => sub { + ... }, onEnd => sub { ... } { my ($class,$ua,%args)=@_; my $onStart=$args{'onStart'}//sub {}; my $onDocument=$args{'onDocument'}//sub {}; my $onEnd=$args{'onEnd'}//sub {}; $ua->add_handler( response_header => sub { # my ($response,$ua,$h)=@_; my ($response,$ua)=@_; $onStart->($response,$ua); # Remember how many times we called the onDocument callbac +k $response->{'.multipartfilter'}=0; return; }, m_media_type => "multipart/" ); my $flushDocuments=sub { my ($parts,$response,$ua)=@_; # get rid of all the parts that we have already processed for (my $i=0; $i<$response->{'.multipartfilter'}; $i++) { shift @$parts; } # call the onDocument callback for all new parts for my $part (@$parts) { $onDocument->($part,$response,$ua); $response->{'.multipartfilter'}++; } }; $ua->add_handler( response_data => sub { # my ($response,$ua,$h,$data)=@_; my ($response,$ua)=@_; my @parts=$response->parts(); # The last part is special, it may be not yet completely t +ransmitted. # All other parts are complete, but may still need the onD +ocument # callback call: my $lastpart=pop @parts; $flushDocuments->(\@parts,$response,$ua); my $clen=$lastpart->header('Content-Length'); defined($clen) or die "Missing Content-Length header in mu +ltipart response"; my $cref=$lastpart->content_ref(); # don't copy possibly l +arge content around # If the content is shorter than announced, we have not ye +t # received all of it, and must not call the callback now. unless (length($$cref)<$clen) { $onDocument->($lastpart,$response,$ua); $response->{'.multipartfilter'}++; } return 1; }, m_media_type => "multipart/" ); $ua->add_handler( response_done => sub { # my ($response,$ua,$h)=@_; my ($response,$ua)=@_; my @parts=$response->parts(); # All parts are complete, including the last one. # Make sure the callback has been called for all of them. $flushDocuments->(\@parts,$response,$ua); $onEnd->($response,$ua); return; }, m_media_type => "multipart/" ); return; } 1; [download] Test script Uses LWP::UserAgent for a GET request to each of the two CGIs shown below, dumps the complete response, and each document found in a multipart container. #!/usr/bin/perl use v5.12; use warnings; use Data::Dumper; use LWP::UserAgent; use MultipartFilter; $\|=1; my $ua=LWP::UserAgent->new(); MultipartFilter->hookInto( $ua, onStart => sub { say "* Hey, I got a multi-part response!"; }, onDocument => sub { my ($part,$response,$ua)=@_; say " Hey, I got a new document!"; say "Headers are:"; say $part->headers->as_string(); say "Content is:"; say $part->content(); }, onEnd => sub { say " End of multipart response."; } ); say "* single-part response "; my $resp=$ua->get("http://localhost/~alex/files/server-push/hello.cgi" +); say $resp->as_string(); say " multi-part response *"; $resp=$ua->get("http://localhost/~alex/files/server-push/push-server.c +gi"); say $resp->as_string(); [download] Simple and stupid hello world CGI Available at `http://localhost/~alex/files/server-push/hello.cgi`, no server push. `#!/usr/bin/perl use v5.12; use warnings; $\|=1; print "Status: 200 OK\r\n", "Content-Type: text/plain\r\n", "\r\n", "Hello World!\r\n";` [download] Simple and stupid server push CGI Available at `http://localhost/~alex/files/server-push/push-server.cgi`, delivering three documents in a `multipart/mixed` container. There is an intentional delay of one second between each document. Another delay of one second happens while each document is written. The purpose of this delay is to split the document over two chunks in LWP::UserAgent, just to make it harder to get a complete document. #!/usr/bin/perl use v5.12; use warnings; $\|=1; my $boundary=join('-','cut','here',$$,time(),rand()); print "Status: 200 OK\r\n", "MIME-Version: 1.0\r\n", "Content-Type: multipart/mixed; boundary=\"$boundary\"\r\n", "\r\n"; for (1..3) { my $now=localtime; my $text1=<<__END_OF_HTML__; <!DOCTYPE html> <html> <head><title>Testing multipart/mixed</title></head> <body> <h1>Testing <code>multipart/mixed</code></h1> __END_OF_HTML__ my $text2=<<__END_OF_HTML__; <dl><dt>Counter</dt><dd>$_</dd><dt>Server time</dt><dd>$now</d +d></dl> </body> </html> __END_OF_HTML__ print "--$boundary\r\n", "Content-Type: text/html; charset=windows-1252\r\n", "Content-Length: ",length("$text1$text2"),"\r\n", "X-My-Counter: $_\r\n", "\r\n", $text1; sleep 1; print $text2; sleep 1; } print "--$boundary--\r\n"; [download] So, to properly handle documents, `use MultipartFilter;`, remove the `:content_cb` handler, delete `sub raw_handler`, and insert the following code before `my $response = $browser->get(...);`: `MultipartFilter->hookInto($browser, onDocument => sub { my $part=shift; $twig->parse($part->content()); } );` [download] Test enviroment: Apache 2.4.12 and Perl 5.18.1 on Slackware64 14.1 Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^5: XML Parsing from URL by afoken (Chancellor) on Jul 05, 2015 at 12:28 UTC
OK, it's HTTP server push. I haven't seen that in the wild until now. That should be handled by LWP. Unfortunately, LWP is mostly designed to receive a single document from an HTTP request, not a (possible infinite) steam of documents wrapped in a `multipart/mixed` container. The `:content_cb` callback in your original code is a way to hook into LWP::UserAgent, but it looks a bit scary. LWP documents that the callback is called with "a chunk of data". That may be everything from a single byte to a large block containing several documents and headers. Your code treats that "chunk of data" first as exactly one document with HTTP and multipart headers (in `$data =~ /^--boundary/`), then as a full XML document (in `$twig->parse($data);`). This may "just work" because (a) the camera waits sufficiently long times between each part of the multipart container, (b) the XML document is very small, and (c) the `:read_size_hint` is sufficiently large to make `LWP::UserAgent` read the entire document, including headers, into one chunk. A less scary version would collect chunks somewhere (e.g. in a private attribute of the `$response` object) until at least one complete document with HTTP and multipart headers was collected. Then, it should extract headers and the raw document from there (e.g. into a new HTTP::Response object), and only then call `$twig->parse($document)`, perhaps as another callback. Maybe that could be made into a separate module that extends `LWP::UserAgent` to handle multipart document streams. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom