comment on

Hi,

Step one to fixing this is, forget the program exists, and define your goals

For example , mirror the title/alt/image of all xkcd , so that would be

- get xkcd page
- extract info ( id title alt text image next )
- save files 
- repeat with next

Next up tweak the goals a bit, be nice

- get page if not already exist
- extract info ( id title alt text image next ) and de-html-textify
- save files with safe filenames 
- repeat with next
- wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress

Next is write (code) the program of goals

save_xkcd( 'outdir', 'startingid' );
sub save_xkcd {
    $starting_id ||= id_from_progress();
    my @ids = $starting_id;
    while( @ids ) {
       my $cid = shift @ids;
       my $page = sptintf '...%s', $cid;
       $mech->get( $page );
       save_stuff( $mech, $cid );
       next_page( $mech , \@ids );
       maybe_sleep();
    }
}
[download]

Now all you do is fill in the blanks

No need for CGI in this equation, cgi doesnt like near infinite loops anyway

$mech->title gets you de-htmld text like xkcd: House of Pancakes

HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of '//img/@title' and next link with a query of '//a[@rel="next"]'

Or $mech->find_link( text_regex => qr/next/i );

Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)

In reply to Re: LWP::Simple on HTTPS sites ( WWW::Mechanize ) by Anonymous Monk
in thread LWP::Simple on HTTPS sites by rbhyland

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.