pc0019 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks

How would I download the first X bytes of a file?

eg. download the first 512 bytes of perlmonks.org/awsum.site

Preferably using WWW::Mechanize, but if this isn't the best option please say.

Part of the code I'm using to download a (whole) file is:
my $mech = WWW::Mechanize->new; $mech->get($url); $mech->get( $mech->find_link(text=>$file)->url, ":content_file"=>"$loc +al_file" );

Thanks!

PC

Replies are listed 'Best First'.
Re: Downloading first X bytes of a file
by CountZero (Bishop) on Jun 08, 2008 at 20:57 UTC
    LWP::UserAgent can do that for you, but why you would load only the first 512 bytes of a website I cannot fathom: you are almost guaranteed to get an incomplete (and therefore incorrect) HTML-file. Perhaps you want only the headers? Or you wish to check if the site responds? Again the $ua->head( $url ) can do that for you.

    Just for the sake of showing the power of LWP::UserAgent, here is a script that loads the first 512 bytes:

    use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new(); my $url = "http://www.perlmonks.org"; my $buffer; $ua->get($url, ':content_cb' => \&first_chunk, ':read_size_hint' => 51 +2 ); print $buffer; sub first_chunk { my ($chunk, $response_ref, $protocol_ref) = @_; $buffer .= $chunk; die() if length($buffer) >= 512; }

    Please note that the "size hint" is just a hint. The server may or may not follow it, so you may end up with more than you want.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      but why you would load only the first 512 bytes of a website I cannot fathom

      I can only guess pc0019's intentions, but if had to write a script that fetches the title of an HTML page, I'd go in that direction. Download the first chunk, check if the title is in it. If not, check if </head> already occured. If yes, there is no title. If not, get the next chunk.

        That might be the case, but you could have 100K of javascript and/or embedded style-sheets first and nothing guarantees that </head> is not found somewhere in there. But as your HTML file is not complete, parsing the data for the content of <head> ... </head> might become a very hazardous operation.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Thanks everyone for the replies.

      CountZero that code is just what I was looking for!

      I should have explained my self a bit more. I'm learning Perl and havn't been using it much, so at the moment I'm just playing around, seeing what it can do.

      If I download the first part of an mp3 file chances are it will have the tag in (well, so far anyway) which I can read and see what the file contains - if the name isn't self-explanatory. So, an entirely useless script but I wanted to see if it could be done ;-)

      Here the final script if anyones interested. I think it needs a lot of cleaning up >.<

      Does the -w in the shebang line and the 'use warnings' mean the same thing? And what are things like the '-w' called (eg the -f or -d to specify file or directory)? Google doesn't like searching for one-letter things which makes it nearly impossible for me to find out!


      #!/usr/bin/perl -w use strict; use warnings; use MP3::Tag; use LWP::UserAgent; my $dir = "./"; my $ua = LWP::UserAgent->new(); my $url = ""; ## <<< url to get files from my $buffer; my @files; my $response = $ua->get($url); if($response->is_success){ my $con = $response->content; while ($con =~ /<a href=\"(.*?)\.mp3\">/g){ push @files, "$1.mp3"; } } foreach my $file(@files) { print "Reading $file..."; $ua->get($url.$file, ':content_cb' => \&first_chunk, ':read_size_h +int' => 10240 ); sub first_chunk { my ($chunk, $response_ref, $protocol_ref) = @_; $buffer .= $chunk; die() if length($buffer) >= 10240; } ## Clean filename $file =~ tr/+/ /; $file =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg; $file =~ s/<!--(.|\n)*-->//g; ## Save file open FILE, ">$file", or die $!; print FILE $buffer; close FILE; print " Done!\n"; ## Clear buffer $buffer = ""; ## Find tags my $filename = "$dir$file"; my $mp3 = MP3::Tag->new($filename); my ($song, $track, $artist, $album) = $mp3->autoinfo(); ## Print tag: Artist - Album (if any) - Song unless ($album) { if ($artist) { $album = " - "; } } else { $album = " - $album - "; } print "$artist$album$song\n"; }


      edit: hmm I don't think I put my post in the right place...

Re: Downloading first X bytes of a file
by moritz (Cardinal) on Jun 08, 2008 at 20:30 UTC
    ISTR that WWW::Mechanize is a sub class of LWP::UserAgent, which has many hooks to achieve stuff like that.

    I don't know the specifics, but I guess you could provide :read_size_hint => 512,, and install a callback that handles the first chunk.

    I don't know what to do to stop it from downloading the rest (throw an exception perhaps? add Range header?), but that direction seems worth investigating.

Re: Downloading first X bytes of a file
by misc (Friar) on Jun 09, 2008 at 11:51 UTC
    Another way:

    using IO::Socket::INET.
    This has the advantage to be more flexible.

    I don't know what you are looking for in the first 512 bytes, but dependent on what you get you could parse the response and fetch more data if needed.

    #!/usr/bin/perl -w use strict; use IO::Socket::INET; my $sock = IO::Socket::INET->new(PeerAddr => 'www.google.de', PeerPort => 'http(80)', Proto => 'tcp'); die if ( !$sock ); print $sock "GET http://www.google.de/index.html\n"; my $len = 0; my $file = ''; while ( defined($sock) && ( my $c = $sock->getc()) && ($len<512)){ print $c,"\n\n"; $len+=length $c; $file .= $c; } print $file;

    michael
      print $sock "GET http://www.google.de/index.html\n";

      Since you don't send a Hostname-header, you should send at least the HTTP/1.0 version string. And don't you need two newlines at the end, end perhaps a few carriage returns as well?

      Also note that there is more to getting web pages than sending one line of HTTP header. Your example won't follow redirects, for one thing, doesn't have error handling etc. There's a reason we use modules to abstract that stuff away.

      Besides, IMHO it's not very friendly to request a full page (no range header present) and then only read a part of the reply.

        Yes, you are right about the HTTP protocol.

        It's just a quick hack to show how it would work,
        after I tested the script successfully I didn't bother to look up the http references...
        Lazy as I am I debug my cgi scripts from time to time with telnet this way.

        Although I believe to remember if you don't send the HTTP Version the webserver has to assume you are a HTTP/1.0 client.

        There's however a reason I suggested to go this way,
        It's possible to parse the data you get and to request more if needed.

        It's also hard to say which way I would go without knowing about the purpose of reading the first 512 bytes of a page.

        Although I believe it's very often better to write your own module,
        you'll learn something and sometimes existing modules simply don't do what you expect due to their complexity.
        It's also easier to customize your own modules.

        I believe closing the socket SHOULDN'T have any effect on the remote server, since it's always possible the connection breaks.

        But I agree the script is not very friendly, I also would do some further work before using it..

      All you need is the relative part of the path. You also generally need the HTTP protocol spec on the end or the server will complain.

      print $sock "GET / HTTP/1.0\n";
        Although I still didn't take a closer look into the http specs..

        Seems to be interesting:
        telnet www.google.de 80
        GET http://www.google.de/ HTTP/1.0[RET]
        Will return with headers.

        while
        GET http://www.google.de/ [RET]
        Doesn't show up any http headers,
        instead it just returns the file.

        I shouldn't hit stop while posting...