rkellerjr has asked for the wisdom of the Perl Monks concerning the following question:

I convert data for a living and have not dealt with browsers or the internet directly with Perl. However, recently a client asked us to directly download their data from their secure website. This was something new (and exciting!) that I had not done so I went off, did some research, and wrote a program that has worked fairly well. Recently I ran the program to download the data and received a certificate error, which I had never seen before. OK, so, researched that, added code, now it by-passes that. However, my other challenge I have not been able to resolve is this... the program doesn't download the entire page of data any longer. It gets maybe 90% - 95% of the page and then stops and moves on to the next page of data. The only difference I can think of is that I upgraded from Activestate 5.10 to 5.16 but, I wouldn't think that would make a difference but it might. If I use the URL directly in my browser (any page of data) the entire page of data downloads just fine so ... I'm not sure what you guys might need to help out but, I need to be conscience of proprietary information.

Here is the major piece of code doing the work, with names changed to protect the innocent. :)
while ($more) { $page++; $url = "https://[server name is here]/[path information here]/$eleme +nt/HAY/?page=$page"; $filepage = "0" x (3 - length($page)) . $page; $response = $browser->get($url,':content_file' => $tempxml,); $file = "$output\\$element" . "_" . $filepage . ".xml"; $response = $browser->get($url,':content_file' => $file,); die "Couldn't get $url\n" unless defined $response; $more = &check_tmp; unlink ("temp.xml"); print "Completed $element page \($page\) file \($filepage\) \($more\ +) ...\ }

Because there is more than one page of data and I do not know the last page of data I use a temp.xml file to download the data then check the file to see if it has data, if it does I copy it to another location then delete temp.xml and basically grab the next page of data and loop that until no more page data is available. To get past the certificate issue I added code...

.
$browser = LWP::UserAgent->new(ssl_opts => { verify_hostname => 0, SSL_verify_mode => SSL_VERI +FY_NONE});

I also have browser credentials, etc. that work fine. So, any clue as to why I am no longer getting the entire page of XML data any longer? And thanks for your time folks!

Replies are listed 'Best First'.
Re: LWP Browser->Get Challenge
by RichardK (Parson) on Aug 27, 2013 at 15:47 UTC

    Well if your loop is ending early then it must be check_tmp() that's the problem.

    BTW If you want a number with leading zeros, I think it easier to use sprintf, so something like this:-

    my $filename = sprintf( "%s%03d.xml",$path,$count);
Re: LWP Browser->Get Challenge
by DanielSpaniel (Scribe) on Aug 27, 2013 at 16:56 UTC

    I actually had a very similar sort of problem a while back, where the files I was trying to download were quite large, but I never managed to resolve it using LWP.

    However, while this may not be what you want to do, I managed to resolve my own problem by using curl instead, and setting a limit-rate on the downloads. i.e. I used my same Perl script for everything, except for the downloads I simply issued a system call from my script to kick off curl with the options I provided.

    You sound like somebody who may already know about curl, in which case you'll already know there are a gazillion options for it. My downloads were mostly FTP, but it works equally well for http, or whatever.

    The option you'd want to use with curl is "--limit-rate".

    For example, one of my commands looks similar to this:

    $cmd="curl -u $uname:$pwd -O --limit-rate 50K ftp://$host$dir$file"
    ... where 50K is the limit-rate which I've set, but you can set it to whatever you like (e.g. 10K, 35K, etc).

    I'm sure there are also probably curl options for getting past the certificate too.

Re: LWP Browser->Get Challenge
by Anonymous Monk on Aug 27, 2013 at 22:55 UTC