Hi,

Step one to fixing this is, forget the program exists, and define your goals

For example , mirror the title/alt/image of all xkcd , so that would be

- get xkcd page
- extract info ( id title alt text image next )
- save files 
- repeat with next

Next up tweak the goals a bit, be nice

- get page if not already exist
- extract info ( id title alt text image next ) and de-html-textify
- save files with safe filenames 
- repeat with next
- wait andor quit, when done quit, when limit reached wait or quit until next time, keep track of progress

Next is write (code) the program of goals

save_xkcd( 'outdir', 'startingid' ); sub save_xkcd { $starting_id ||= id_from_progress(); my @ids = $starting_id; while( @ids ) { my $cid = shift @ids; my $page = sptintf '...%s', $cid; $mech->get( $page ); save_stuff( $mech, $cid ); next_page( $mech , \@ids ); maybe_sleep(); } }

Now all you do is fill in the blanks

No need for CGI in this equation, cgi doesnt like near infinite loops anyway

$mech->title gets you de-htmld text like   xkcd: House of Pancakes

HTML::TreeBuilder::XPath gets you the alt/title text with xpath query of '//img/@title' and next link with a query of '//a[@rel="next"]'

Or  $mech->find_link( text_regex => qr/next/i );

Yes, you could fix up your program by replacing LWP::Simple with mech ... but thats not exactly fun now isnt it :)


In reply to Re: LWP::Simple on HTTPS sites ( WWW::Mechanize ) by Anonymous Monk
in thread LWP::Simple on HTTPS sites by rbhyland

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.