Hi,
Step one to fixing this is, forget the program exists, and define your goals
For example , mirror the title/alt/image of all xkcd , so that would be
- get xkcd page - extract info ( id title alt text image next ) - save files - repeat with next
Next up tweak the goals a bit, be nice
- get page if not already exist - extract info ( id title alt text image next ) and de-html-textify - save files with safe filenames - repeat with next - wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress
Next is write (code) the program of goals
save_xkcd( 'outdir', 'startingid' ); sub save_xkcd { $starting_id ||= id_from_progress(); my @ids = $starting_id; while( @ids ) { my $cid = shift @ids; my $page = sptintf '...%s', $cid; $mech->get( $page ); save_stuff( $mech, $cid ); next_page( $mech , \@ids ); maybe_sleep(); } }
Now all you do is fill in the blanks
No need for CGI in this equation, cgi doesnt like near infinite loops anyway
$mech->title gets you de-htmld text like xkcd: House of Pancakes
HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of '//img/@title' and next link with a query of '//a[@rel="next"]'
Or $mech->find_link( text_regex => qr/next/i );
Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)
In reply to Re: LWP::Simple on HTTPS sites ( WWW::Mechanize )
by Anonymous Monk
in thread LWP::Simple on HTTPS sites
by rbhyland
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |