no_germs has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script for looking for a specific string in a url. currently i'm using LWP::Simple.get to just get the entire content of the url and then just search it using regex. on one hand the content is very long, but at the same time the string i'm looking for is always at the start, so is there a way to get just part of the content of the url? thanks, noam
  • Comment on How to read just part of a url's content

Replies are listed 'Best First'.
Re: How to read just part of a url's content (byte ranges)
by tye (Sage) on May 29, 2007 at 15:41 UTC

    Add a "Content-Range: bytes=0-512/*" header to your request if what you are wanting is to just get some of the bytes of the response (at least if the response previously included the "Accept-Ranges: bytes" header). Google can tell you more.

    Update: Thanks to jettero for finding the right header. I've done this in the past but my quick look for the specifics ran into the wrong header at first.

    - tye        

Re: How to read just part of a url's content
by kyle (Abbot) on May 29, 2007 at 15:30 UTC

    If you use LWP::UserAgent, it looks like you can pass the :content_cb option to get and deal with the response a little at a time. If your callback calls die, the request is aborted, so you can quit reading once you've found what you want.

    Update with code and output:

    use LWP::UserAgent; sub little_bit { my ( $content, $response, $protocol ) = @_; printf "chunk length %d\n", length $content; if ( $content =~ /the/i ) { print "chunk with the 'the': $content\n"; die; } } my $ua = LWP::UserAgent->new(); my $response = $ua->get( 'http://perlmonks.org/', ':content_cb' => \&little_bit, ':read_size_hint' => 100 ); __END__ chunk length 100 chunk length 100 chunk length 100 chunk with the 'the': The Monastery Gates </title> <link rel="stylesheet" href="/css/common.css" type="text/
Re: How to read just part of a url's content
by jettero (Monsignor) on May 29, 2007 at 15:17 UTC
    You mean just get part of the file located with the URL?

    There is almost certainly a way to do it, since wget (et al) can be instructed to continue.

    Personally, I can't find a single relevant option in LWP, LWP::UserAgent, HTTP::Request, HTTP::Headers, and more.

    I'm very curious to see how how you would get a selected portion of a file with LWP. Someone will know.

    UPDATE #1: Further investigation has revealed that you can set a "range" header with
    $request_object->header( $field => $value ); but I haven't yet worked out the particulars of the header.

    -Paul

      You mean just get part of the file located with the URL?

      No, "file" is a less accurate accurate word. The content identified by a URI is not necessarily a file.

        That's a semantic argument. Is a file a space on a hard drive? Is it a stream of bytes? A sector on a tape? There's an application where I work that describes a file as 100 variably sized blocks. My point was that the URL just describes the location of something. You already have it.

        -Paul

Re: How to read just part of a url's content
by naikonta (Curate) on May 29, 2007 at 15:18 UTC
    looking for a specific string in a url
    use URI;
    get just part of the content of the url
    use HTML::TokeParser;

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!