Fetch html to a file

Gnuser has asked for the wisdom of the Perl Monks concerning the following question:

I subscribe to online books, and I want to fetch my books and put them to a file. I've been researching for weeks, reading posts and pods, mans and libs.

I don't want to use tk 'cause I'm not that savy. I know what I need is in LWP::UserAgent, Open::SSLeay, HTTP::Request, Get::Opt and more.

I'm new to programming and my code snippet is laughable, but I am trying. And incomplete, I realize I need serious direction.

I'm stuck on the secure https authentication, I haven't been able to get past the asp form, even after installing and reading the secure socket module. Then I need to get to my book and fetch it to one file, recursively finding all pages (which are links), or should I use an array? I don't know how to piece it together, and because there's more than one way to write perl, I'm half delirious.

I'm open for any suggestions, advice and criticism.

use strict;
use LWP::UserAgent;
use Open::SSLeay;
use HTTP::Request::Common;

# ASP FORM SOURCE CODE 
# <form name=login method=post action="mainin.asp">
#    <font size="2">E-Mail Address :</font>
#    <input type=text name="login" size=20 maxlength=60 value="" class
+=inputtext>
#    
#    <font size="2">Password :</font>
#    <input type="password" name="passwd" size=20 maxlength=60 class=i
+nputtext>
# <input type="submit" name="action" value="Log In" class="inputbutton
+">
# </form>

# I NEED TO LOGIN
  my $ua = LWP::UserAgent->new;
  my $ua->request(POST 'https://domain/sec/mainin.asp?login', [login =
+> foo@bar.com, passwd => snafu]);

# HOW DO I RECURSIVELY FIND ALL LINKS, OR USE AN ARRAY?
# ANY ADVICE ON WHERE TO START TO WRITE THIS CODE?
  @PAGE = 

# HOW ABOUT MORE THAN ONE BOOK - I HAVEN'T A CLUE.
  @BOOK = 
 
# AFTER LOGIN SUCCESS, I NEED TO 'GET' THE BOOK, APPENDING EACH PAGE T
+O FILE FETCHEDBOOK.TXT 
  my $ua = LWP::UserAgent->new;
  my $req = HTTP::Request->new(GET => 'http://domain/main.asp?bookname
+=@BOOK&page=@PAGE');
  my $res = $ua->request($req,>>"FETCHEDBOOK.TXT");
  if ($res->is_success) {
      print $res->FETCHEDBOOK.TXT;
  } else {
      print "Failed: ", $res->status_line, "\n";
  }
[download]

Many thanks in advance.

Comment on Fetch html to a file Download Code

Replies are listed 'Best First'.
Re: Fetch html to a file by kappa (Chaplain) on May 31, 2002 at 17:09 UTC
You can use HTML::SimpleLinkExtor to easily find all links in a document. But you'd better get thru all HTTPS to the toc of your book first :)	[reply]
Re: Fetch html to a file by hacker (Priest) on May 31, 2002 at 21:13 UTC
I wrote a meditation awhile back on a very similar subject. I've been working on lots of spidering/gathering techniques for some time. Go take a look at the Plucker perl spider in my cvs for some ideas. Note, I am not "Pat" in the header, he's a good friend of mine who got the perl spider project started a few years ago with me. This should give you plenty of ideas.	[reply]