in reply to LWP::Simple on HTTPS sites
Hi,
Step one to fixing this is, forget the program exists, and define your goals
For example , mirror the title/alt/image of all xkcd , so that would be
- get xkcd page - extract info ( id title alt text image next ) - save files - repeat with next
Next up tweak the goals a bit, be nice
- get page if not already exist - extract info ( id title alt text image next ) and de-html-textify - save files with safe filenames - repeat with next - wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress
Next is write (code) the program of goals
save_xkcd( 'outdir', 'startingid' ); sub save_xkcd { $starting_id ||= id_from_progress(); my @ids = $starting_id; while( @ids ) { my $cid = shift @ids; my $page = sptintf '...%s', $cid; $mech->get( $page ); save_stuff( $mech, $cid ); next_page( $mech , \@ids ); maybe_sleep(); } }
Now all you do is fill in the blanks
No need for CGI in this equation, cgi doesnt like near infinite loops anyway
$mech->title gets you de-htmld text like xkcd: House of Pancakes
HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of '//img/@title' and next link with a query of '//a[@rel="next"]'
Or $mech->find_link( text_regex => qr/next/i );
Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)
|
|---|