Fetching files (downloading) from the Internet (extra characters, file handles, file::fetch)

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have been trying to find a way to download files off the internet given the URL and have run into 2 odd (at least to me) perl quirks along the way. I was hoping that a monk might be able to enlighten me as to what is going on here.

First, I tried to use the socket package and manually connect to the server and initiate a http session. Then, issue a GET command and download the file to what appears to be a file handle. From here I put the file as a string into a scalar variable and finally printed the string to a file through a second file handle.

The problem here was that I found extra chars in the downloaded file. They seem to be Hex 0D characters that are ASCII carriage returns. After doing some trouble shooting and not getting very far, I ran some code as an experement to determine if the problem was that I was not aware of something in the http or if the problem was in my use of the file handles. The code is below:

open (INDAT, "<pic.jpg") || die "Error -  unable to open input file $!
+"; 
open (OUTDAT, ">copypic.jpg") || die "Error - unable to open output fi
+le $!";

while (<INDAT>){
    print OUTDAT $_;
}

close(INDAT);
close(OUTDAT);
[download]

In this code, I basically open a jpeg file read it from one file handle and write it it a second. Just as i had done in my downloading code. This seems to introduce extra characters just as my downloading code did. So, what is going on? Does it have anything to do with the non-text characters in the jpeg? Also, I tried slurping the file, but still had problems with extra characters in the output file. Also, at this point, I am wondering if I really don't know what is going on with reading and writing using file handles.

Second, after not having complete success with this approach I went to LWP package and the file:: fetch package. I had similar problems with the LWP package and finally had success downloading with the file::fetch package. The only problem with the file::fetch solution is when I tried to retreive a file from the root of a domain.

ex. http://www.somesite.com/

This should return the index.html page but it does not. The file::fetch package returns an error and will not get anything. So to fool it I added a space to the end of the string used to create the file fetch object. This actually works, but... When I try to print out the file using the ff->fetch subroutine, I get a file in the desired directory but it has just a dash and a number for a name and the code crashes when the file fetch object tried to change its name to space (As it should). So I was wondering is there a way to get the file fetch object to just get the default html file from the / directory of the domain with out tricking it, to have it write the file to a filename that I specify and is not 'space', or can I access the file as a string and write it to the HD myself (This would also be useful so I could analyze the file without putting it on the HD at all). I have already tried to specify a path that includes a file name but this just creates a folder with the filename as its name and puts the file in it.

Thank you for any help that you can give.

Comment on Fetching files (downloading) from the Internet (extra characters, file handles, file::fetch) Download Code

Replies are listed 'Best First'.
Re: Fetching files (downloading) from the Internet (extra characters, file handles, file::fetch) by mr_mischief (Monsignor) on Nov 24, 2008 at 22:36 UTC
One of two things is happening here. One could be a very simple fix. I didn't read closely enough to tell which is your problem for certain, as you're going about this in a way which suggests you want to code and debug it yourself rathewr than using existing tools anyway. If you're on Windows or another OS with line-ending translation for text files, use binmode on non-text files. Writing binary files in text mode on OSes that discern between them is an easy and common mistake and it is easily fixed. Alternately, it could be your handling of the HTTP protocol. The HTTP protocol (as with most other text-based application-level Internet protocols) specifies line endings in ASCII linefeed/carriage-return pairs for the protocol elements themselves. Certain media types also use this, although that varies based on the media type of the body. The authors of modules for Perl to deal with these things such as LWP, LWP::Simple, and WWW::Mechanize know this and handle it in their code. See RFC 1945 section 2 paragraph 2. HTML is one media type that wants cr/lf. If you're going to roll your own solution for standardized protocols, you're going to have to do your own standards research besides your own coding and testing. If you still want to roll your own, that's great. If not, use what's provided. Either is a valid decision, but you should probably have a good reason for reinventing existing wheels. Just don't blame the tools because you didn't do the reading.	[reply]
Re^2: Fetching files (downloading) from the Internet (extra characters, file handles, file::fetch) by Anonymous Monk on Nov 25, 2008 at 16:14 UTC
Hey, thanks, I AM working on a windows machine and the binmode function fixed both my implementation of an http downloader and LWP. So it was a file handle problem. You are right about reinventing the wheel, but I wanted to spend some time on doing that to learn. So after about a day of doing it myself I feel that I have a fair understanding of what is happening in http and i am moving on to LWP. I looked at WWW-mechanize but could not find that exact module in the ppm window. Is there 2 differtent version of that module (One for UNIX and one for windows)?.	[reply]
Re^3: Fetching files (downloading) from the Internet (extra characters, file handles, file::fetch) by mr_mischief (Monsignor) on Nov 25, 2008 at 16:41 UTC
I don't think there are two different WWW::Mechanize packages for Unix and Windows. There may not be a PPM for it for several reasons, though. Since LWP and LWP::Simple can do most of the same things and there's also Win32::IE::Mechanize that does the same things as WWW::Mechanize but using the IE engine it may be a lower priority to put in the PPM repositories. ActivePerl may be able to load it through CPAN instead. It might be available in the newer repositories ActiveState just announced with more packages. Strawberry Perl may be able to use it from CPAN if ActiveState can't. There are passing reports for WWW::Mechanize tests on Windows, so someone has it working on that platform in some fashion. Perhaps someone who does more work on the Windows platform could answer more authoritatively.	[reply]