espressoguy has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to get started with some screen-scraping using WWW::Mechanize. To debug my code I don't want to query the website repeatedly; I want to get the html originally, save it as text, then develop a WWW::Mechanize script to parse-out the anchors I want. I don't understand how to read-in the html to Mechanize for use while debugging the script. Thanks in advance for any help ...

Replies are listed 'Best First'.
Re: WWW::Mechanize - offline debugging
by keszler (Priest) on Nov 07, 2009 at 22:36 UTC
    file:///some_file.html is a valid URI:
    use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new(); $mech->get('file:///test.html'); print $mech->content;

    Update:

    The file URI file:///some_file.html is a shortcut representation of file://localhost/some_file.html, and refers to the file some_file.html in the root directory.

    The file URI for /home/espressoguy/mechtest/site_a.html would be file:///home/espressoguy/mechtest/site_a.html.

    If you have spaces in the directory names, encode them as %20:
    file:///Documents%20and%20Settings/EspressoGuy/My%20Documents/Mech%20Test/site_a.html

      $ echo >1.html $ perl -MURI::file -le"print URI::file->new(shift)->abs(URI::file->cwd +)" 1.html file:///D:/temp/1.html $ lwp-request file:1.html ECHO is on. $ lwp-request file:///D:/temp/1.html ECHO is on. $
Re: WWW::Mechanize - offline debugging
by Corion (Patriarch) on Nov 08, 2009 at 11:58 UTC

    You can save the page to a local file:

    $mech->save_content('myfile.html'); ... # time passes $mech->get('file://myfile.html');

    and alternatively, you can load the content from disk manually, thus keeping the URLs as you want them:

    my $html = <<HTML <html>...</html> HTML $mech->set_html($html);

    Look in the documentation of WWW::Mechanize and potentially look at the .t test files distributed with it.

      Thanks Corion; "set_html" doesn't work (or I wasn't able to get it -or update_html- to work) but save_content and "get('file://...') does. Thanks alot!!

        I'm sorry - the method is called ->update_content.

Re: WWW::Mechanize - offline debugging
by spx2 (Deacon) on Nov 08, 2009 at 02:43 UTC

    maybe make an offline copy with httrack or wget:

    httrack --depth=3 --ext-depth=0 --stay-on-same-dir --stay-on-same-tld --stay-on-same-domain < site_url >

    there must be some switch to convert the links to relative links, and then you can put it in apache and fire up apache and pretend you're accessing the site.