Scott203 has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a script to download a file from the web to my local server. The problem is that the remote server is incorrectly reporting the content-length of the file as about half of what it really is. LWP will only download the portion of the file specified by the content-length. BUT, that is not the only problem. The remote file has a ".gz" suffix even though it is a text database file. (I assume) this is what causes the (partial) file to be corrupt when it gets to my server.

So far I have tried some of LWP's functions and they are giving me the above problems.

The remote file is at:

http://cpaweb01.planetarion.com/botfiles/galaxy_listing.gz

Thanks in advance, I hope someone can solve this...

  • Comment on Retrieving a file on the web (NOT so easy)

Replies are listed 'Best First'.
Re: Retrieving a file on the web (NOT so easy)
by tadman (Prior) on Jul 11, 2001 at 08:16 UTC
    I hate to break it to you, but that file is compressed with gzip. Netscape, IE, and even Lynx, it seems, will ungzip it for you automatically:
    % lynx --source 'http://cpaweb01.planetarion.com/botfiles/galaxy_listing.gz' | less
    So it looks like it is a text file, but it is in fact compressed. Hence, the .gz extension, which is usually a dead giveaway. This means that use of LWP::Simple is going to cause trouble. This is because you will get the raw GZIP file. What I did was:
    use LWP::Simple; $content = get ("http://cpaweb01.planetarion.com/botfiles/galaxy_listi +ng.gz"); print $content;
    You can do something like:
    % perl test_get | gunzip -dc | less
    A better solution would be to use the gzip compression library module, or use it externally through backticks or a pipe call.
Re: Retrieving a file on the web (NOT so easy)
by Abigail (Deacon) on Jul 11, 2001 at 11:06 UTC
    $ telnet cpaweb01.planetarion.com 80
    Trying 193.150.193.131...
    Connected to cpaweb01.planetarion.com.
    Escape character is '^]'.
    HEAD /botfiles/galaxy_listing.gz HTTP/1.1
    Host: cpaweb01.planetarion.com
    
    HTTP/1.1 200 OK
    Date: Wed, 11 Jul 2001 06:59:13 GMT
    Server: Apache/1.3.19 (Unix) mod_fastcgi/2.2.9-SNAP-Sep19-13.50
    Last-Modified: Wed, 11 Jul 2001 06:50:15 GMT
    ETag: "1aa66f-2a7a1-3b4bf727"
    Accept-Ranges: bytes
    Content-Length: 173985
    Connection: close
    Content-Type: text/plain
    Content-Encoding: x-gzip
    
    Connection closed by foreign host.
    $
    

    -- Abigail

Re: Retrieving a file on the web (NOT so easy)
by LD2 (Curate) on Jul 11, 2001 at 08:27 UTC
Re: Retrieving a file on the web (NOT so easy)
by voyager (Friar) on Jul 11, 2001 at 08:20 UTC
    When I point my browser (MS/IE5.5) to the URL, I see it fine; the first few lines are:
    Content: Planetarion galaxy listing (sorted by x, y) Author: vish@planetarion.com Version: 1.00 (2000-11-20 17:35:00 ISO) Date: 2001-07-11 05:07:01 ISO Separator: ' ' Format: X Y GalaxyName Score 1 1 [BLADE]NoPACrew[Xeno][Legion] 70995861 1 2 -=The [CRUSADE]rs=- [TPB][AC] 62489813 1 3 {gods unwanted children} 10634982 1 4 Realm of Exile {Aesir} {BSun} 42432856 1 5 [p5a] [wl] [VA]- Wolves Lair 77182991 1 6 [Trash] [6thR] Fight Club 53804931
    When I run the following:
    #! /perl/bin/perl -w use strict; BEGIN { use CGI::Carp qw(fatalsToBrowser warningsToBrowser); } use LWP::Simple; use CGI; $| = 1; my $q = CGI->new; my $URL = "http://cpaweb01.planetarion.com/botfiles/galaxy_listing.gz" +; my $html = get( $URL ); print $q->header, $q->start_html; print $q->h1('Test'); print $html; print $q->end_html;
    I see (first few lines):
    Test < ÎK; tý]?#Y·&?]GÿS]gFoWÕ>
    So my guess (.gz, after all) is that it's a compressed file, which IE can unzip properly but it's just junk to Perl. Perhaps some other monks can help out with a Perl module to deal with compressed files.

    Update Masem 2001-07-11 : Changed PRE to CODE

Re: Retrieving a file on the web (NOT so easy)
by bikeNomad (Priest) on Jul 11, 2001 at 08:58 UTC
    An easy way to deal with this is to use Compress::Zlib, and either its memGunzip() or its gzopen()/gzread() etc. functions.