mirod has asked for the wisdom of the Perl Monks concerning the following question:
In my everlasting quest to improve XML::Twig I came accross an interesting problem, in a field I am not familiar with anymore:
I want to be able to parse a document, given it's URL. The problem is that the document could potentially be really big, so I don't want to load it into memory all at once. Which seems to rule out LWP and the likes.
So grabing stuff here and there, from the Camel to Lincoln Stein's excellent Network Programming with Perl to posts here on PerlMonks I managed to get a prototype sort of running, for http only (see code below).
Now my question to the network programmers around is: does what I wrote make sense or is there a better way to do this using modules I haven't heard of?
#!/bin/perl -w # call with URL's to get use strict; foreach( @ARGV) { parseurl( $_); } sub parseurl { my( $url)= @_; pipe( README, WRITEME) or die "cannot create connected pipes: $!" +; if( my $pid= fork) { # parent code: parse the incoming file close WRITEME; # no need to write while( <README>) { print "parent: $_"; } close README; } else { # child require IO::Socket; # we'll use a simple socket require URI; # to parse the url my $uri= URI->new( $url); print "url: ", $url, "\n"; print "scheme: ", $uri->scheme, "\n"; print "host: ", $uri->host, "\n"; print "path: ", $uri->path, "\n"; print "port: ", $uri->port, "\n"; print "authority: ", $uri->authority, "\n"; print "canonical: ", $uri->canonical, "\n"; my $address = $uri->host; my $port = $uri->port; $address.= ":$port" unless( $address=~ /:$port$/); my $remote = new IO::Socket::INET( $address) or die "Couldn't connect: $@"; if( $uri->scheme eq 'http') { print $remote "GET " . $uri->path . " HTTP/1.0\n"; print $remote "Host: " . $uri->host . "\n\n"; $|=1; # skip until the end of the header my $status; while( <$remote>) { print "header (discarded): $_"; last if m/^\s*$/ }; } else { die "protocol ", $uri->scheme, "not supported"; } while( <$remote>) { print WRITEME $_; } close WRITEME; } }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: loading a file from a URL with as little overhead as possible
by tachyon (Chancellor) on Aug 15, 2001 at 17:53 UTC | |
|
Re: loading a file from a URL with as little overhead as possible
by shotgunefx (Parson) on Aug 15, 2001 at 17:12 UTC | |
|
Re: loading a file from a URL with as little overhead as possible
by mirod (Canon) on Aug 15, 2001 at 21:31 UTC |