in reply to Downloading first X bytes of a file

LWP::UserAgent can do that for you, but why you would load only the first 512 bytes of a website I cannot fathom: you are almost guaranteed to get an incomplete (and therefore incorrect) HTML-file. Perhaps you want only the headers? Or you wish to check if the site responds? Again the $ua->head( $url ) can do that for you.

Just for the sake of showing the power of LWP::UserAgent, here is a script that loads the first 512 bytes:

use strict; use warnings; use LWP::UserAgent; my $ua = LWP::UserAgent->new(); my $url = "http://www.perlmonks.org"; my $buffer; $ua->get($url, ':content_cb' => \&first_chunk, ':read_size_hint' => 51 +2 ); print $buffer; sub first_chunk { my ($chunk, $response_ref, $protocol_ref) = @_; $buffer .= $chunk; die() if length($buffer) >= 512; }

Please note that the "size hint" is just a hint. The server may or may not follow it, so you may end up with more than you want.

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^2: Downloading first X bytes of a file
by moritz (Cardinal) on Jun 08, 2008 at 21:12 UTC
    but why you would load only the first 512 bytes of a website I cannot fathom

    I can only guess pc0019's intentions, but if had to write a script that fetches the title of an HTML page, I'd go in that direction. Download the first chunk, check if the title is in it. If not, check if </head> already occured. If yes, there is no title. If not, get the next chunk.

      That might be the case, but you could have 100K of javascript and/or embedded style-sheets first and nothing guarantees that </head> is not found somewhere in there. But as your HTML file is not complete, parsing the data for the content of <head> ... </head> might become a very hazardous operation.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Well, it all depends on your application. If my script were actually an IRC bot, I'd go with a simple regex-based approach as described above. In that application it's important not to download 100K at all (risk of DoS-attacks), even if the header is that long.

        I know it's evil to parse HTML with regexes, but sometimes it's simple and convenient, specially if you are happy with a solution that works in 95% to 99% of all cases. Note that all markup, even comments, are disallowed in <title>...</title> tags, which simplifies the matter.

        Of course things are different for more serious matters - if you want an application that extracts the title of all valid HTML pages (and most invalid ones as well) with an accuracy matching that of the w3 markup validator you'll have to download it all.

Re^2: Downloading first X bytes of a file
by pc0019 (Acolyte) on Jun 19, 2008 at 15:06 UTC
    Thanks everyone for the replies.

    CountZero that code is just what I was looking for!

    I should have explained my self a bit more. I'm learning Perl and havn't been using it much, so at the moment I'm just playing around, seeing what it can do.

    If I download the first part of an mp3 file chances are it will have the tag in (well, so far anyway) which I can read and see what the file contains - if the name isn't self-explanatory. So, an entirely useless script but I wanted to see if it could be done ;-)

    Here the final script if anyones interested. I think it needs a lot of cleaning up >.<

    Does the -w in the shebang line and the 'use warnings' mean the same thing? And what are things like the '-w' called (eg the -f or -d to specify file or directory)? Google doesn't like searching for one-letter things which makes it nearly impossible for me to find out!


    #!/usr/bin/perl -w use strict; use warnings; use MP3::Tag; use LWP::UserAgent; my $dir = "./"; my $ua = LWP::UserAgent->new(); my $url = ""; ## <<< url to get files from my $buffer; my @files; my $response = $ua->get($url); if($response->is_success){ my $con = $response->content; while ($con =~ /<a href=\"(.*?)\.mp3\">/g){ push @files, "$1.mp3"; } } foreach my $file(@files) { print "Reading $file..."; $ua->get($url.$file, ':content_cb' => \&first_chunk, ':read_size_h +int' => 10240 ); sub first_chunk { my ($chunk, $response_ref, $protocol_ref) = @_; $buffer .= $chunk; die() if length($buffer) >= 10240; } ## Clean filename $file =~ tr/+/ /; $file =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg; $file =~ s/<!--(.|\n)*-->//g; ## Save file open FILE, ">$file", or die $!; print FILE $buffer; close FILE; print " Done!\n"; ## Clear buffer $buffer = ""; ## Find tags my $filename = "$dir$file"; my $mp3 = MP3::Tag->new($filename); my ($song, $track, $artist, $album) = $mp3->autoinfo(); ## Print tag: Artist - Album (if any) - Song unless ($album) { if ($artist) { $album = " - "; } } else { $album = " - $album - "; } print "$artist$album$song\n"; }


    edit: hmm I don't think I put my post in the right place...