mirod has asked for the wisdom of the Perl Monks concerning the following question:

In my everlasting quest to improve XML::Twig I came accross an interesting problem, in a field I am not familiar with anymore:

I want to be able to parse a document, given it's URL. The problem is that the document could potentially be really big, so I don't want to load it into memory all at once. Which seems to rule out LWP and the likes.

So grabing stuff here and there, from the Camel to Lincoln Stein's excellent Network Programming with Perl to posts here on PerlMonks I managed to get a prototype sort of running, for http only (see code below).

Now my question to the network programmers around is: does what I wrote make sense or is there a better way to do this using modules I haven't heard of?

#!/bin/perl -w # call with URL's to get use strict; foreach( @ARGV) { parseurl( $_); } sub parseurl { my( $url)= @_; pipe( README, WRITEME) or die "cannot create connected pipes: $!" +; if( my $pid= fork) { # parent code: parse the incoming file close WRITEME; # no need to write while( <README>) { print "parent: $_"; } close README; } else { # child require IO::Socket; # we'll use a simple socket require URI; # to parse the url my $uri= URI->new( $url); print "url: ", $url, "\n"; print "scheme: ", $uri->scheme, "\n"; print "host: ", $uri->host, "\n"; print "path: ", $uri->path, "\n"; print "port: ", $uri->port, "\n"; print "authority: ", $uri->authority, "\n"; print "canonical: ", $uri->canonical, "\n"; my $address = $uri->host; my $port = $uri->port; $address.= ":$port" unless( $address=~ /:$port$/); my $remote = new IO::Socket::INET( $address) or die "Couldn't connect: $@"; if( $uri->scheme eq 'http') { print $remote "GET " . $uri->path . " HTTP/1.0\n"; print $remote "Host: " . $uri->host . "\n\n"; $|=1; # skip until the end of the header my $status; while( <$remote>) { print "header (discarded): $_"; last if m/^\s*$/ }; } else { die "protocol ", $uri->scheme, "not supported"; } while( <$remote>) { print WRITEME $_; } close WRITEME; } }

Replies are listed 'Best First'.
Re: loading a file from a URL with as little overhead as possible
by tachyon (Chancellor) on Aug 15, 2001 at 17:53 UTC

    Just write it directly to a file using LWP::UserAgent and then process the file line by line. According to the docs this has a small memory footprint.

    use LWP::UserAgent; my $ua = LWP::UserAgent->new; my $file = '/tmp/temp.txt'; my $url = 'http://foo.com/index.htm'; my $request = HTTP::Request->new( 'GET', $url ); my $response = $ua->request( $request, $file ); open FILE, $file or die $! while (<FILE>) { .....

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: loading a file from a URL with as little overhead as possible
by shotgunefx (Parson) on Aug 15, 2001 at 17:12 UTC
    You can have LWP::UserAgent read a request into a file instead of memory. You can also set it up to use a user defined sub to process the request but it seems it would be trick to try and parse it on the fly (across chunk boundaries.)

    From LWP docs:
    The request() method can process the content of the response in one of three ways: in core, into a file, or into repeated calls of a subroutine. You choose which one by the kind of value passed as the second argument to request(). The in core variant simply returns the content in a scalar attribute called content() of the response object, and is suitable for small HTML replies that might need further parsing. This variant is used if the second argument is missing (or is undef).

    The filename variant requires a scalar containing a filename as the second argument to request(), and is suitable for large WWW objects which need to be written directly to the file, without requiring large amounts of memory. In this case the response object returned from request() will have empty content(). If the request fails, then the content() might not be empty, and the file will be untouched.

    The subroutine variant requires a reference to callback routine as the second argument to request() and it can also take an optional chuck size as third argument. This variant can be used to construct "pipe-lined" processing, where processing of received chuncks can begin before the complete data has arrived. The callback function is called with 3 arguments: the data received this time, a reference to the response object and a reference to the protocol object. The response object returned from request() will have empty content(). If the request fails, then the the callback routine will not have been called, and the response->content() might not be empty.


    -Lee

    "To be civilized is to deny one's nature."
Re: loading a file from a URL with as little overhead as possible
by mirod (Canon) on Aug 15, 2001 at 21:31 UTC

    Thanks shotgunefx and tachyon, see below the result of using LWP::UserAgent. It's much cleaner. And it works under Windows too!

    Anything else I could do to improve it?

    #!/bin/perl -w use strict; my $BUFSIZE= 32678; # default buffer size for XML::Parser::Expat foreach( @ARGV) { parseurl( $_); } sub parseurl { my( $url)= @_; pipe( README, WRITEME) or die "cannot create connected pipes: $!" +; if( my $pid= fork) { # parent code: parse the incoming file close WRITEME; # no need to write while( <README>) { print ; } close README; } else { # child close README; require LWP; $|=1; my $agent = LWP::UserAgent->new; my $request = HTTP::Request->new( GET => $url); my $response = $agent->request( $request, sub { pass_url_content( \*WRITEME, @_); }, $BUFSI +ZE); $response->is_success or die "$url ", $response->message, "\n" +; close WRITEME; } } sub pass_url_content { my( $fh, $data, $response, $protocol)= @_; print $fh $data; }