WWW::Mechanize is a wonderful tool. However, there are all sorts of little things that I like to add that extend beyond what it's intended to do. Some of these issues are "Perl only" (such as Detecting stringified references with WWW::Mechanize), and others are more generic, such as detecting when an HTML entity has been encoded twice (e.g., & -- we had methods that shouldn't be encoding their output, but they did and then they got encoded again in Mason).

To deal with the double-encoding problem, I added a very simple line to my overloaded content() method (see above link):

if ($content =~ /(&(?:[lg]t|amp|quot);)/) { carp "Possible double-encoded HTML entity ($1) found in result +s"; }

However, I can think of several other potential issues I would like to check and make warnings for them optional by instantiating my subclass with a hashref and suppressing warnings that I don't want (maybe a page is trying to display HTML entity codes and wants them double-encoded):

my $mech = WWW::Mechanize::Warnings->new( stringified_references => 1, valid_dtd => 1, encoded_entities => 0, valid_html => 1, );

I think the warnings should be on by default, so only the "encoded_entities" key/value pair would be strictly necessary in the hypothetical constructor above.

What else would you want to see in such a module? If it's anything tricky, implementation ideas would be welcome.

Cheers,
Ovid

New address of my CGI Course.

Replies are listed 'Best First'.
Re: Customizing WWW::Mechanize
by jonadab (Parson) on Jan 13, 2004 at 02:01 UTC

    There are a number of ways to extend WWW::Mechanize, and this is a good one. I can't think of any other specific warnings (unless you want to turn it into a validator), but I did run into a situation the other day wherein a different type of extension would have saved me time. I was writing a script that fetches usage statistics for some online databases that the library subscribes to, and at one point it needed to follow a link like this:

    <a href="javascript:doAdminHyperlink('/eadmin/Profiles/CustomizeServicesDatabasesForm.aspx','','')">

    As an added bonus, doAdminHyperlink was defined in an external .js file that was linked in. Now, browsers have no trouble with this, but I had to retrieve the script, find the function, and translate into Perl the important parts of what it was doing. (It changes a couple of the form's values and submits it.) I ended up doing $mech->set_fields( some stuff); $mech->click(); but I had to read a twenty-some-line function in a language I don't really know in order to find out that that was what I needed to do.

    So, an ECMA script and DOM extension for WWW::Mechanize would be nifty. It would have to be optional, of course. Note that I'm not asking you to add this to your warnings module :-) Just pointing out another example of what you're driving at: WWW::Mechanize is a really cool beginning, but it could be much more.


    $;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/
Re: Customizing WWW::Mechanize
by clscott (Friar) on Jan 13, 2004 at 03:26 UTC

    Wouldn't this be the domain of HTML::Lint ? It is by the same author as WWW::Mechanise I'm sure there's a reason that they are separate.

    HTML::Lint - check for HTML errors in a string or file

    The HTML::Lint distribution comes with Test::HTML::Lint a Test::More-style wrapper around HTML::Lint.

    --
    Clayton
Re: Customizing WWW::Mechanize
by revdiablo (Prior) on Jan 13, 2004 at 06:23 UTC
    when an HTML entity has been encoded twice (e.g., &amp;amp; ...)

    Kind of an off-topic reply, but please note that sometimes this is not an error. I ran into this very thing with my website. I had similar code in place to check for double-encoded entities, but it falsely triggered when I actually wanted &amp;amp;, for instance if I wanted to have &amp; rendered in the HTML output (without resorting to tricks like PM's <code> tags, or similar).

Re: Customizing WWW::Mechanize
by ViceRaid (Chaplain) on Jan 13, 2004 at 17:49 UTC

    Afternoon

    I'm a big fan of Mechanize as well, enough so that I started porting it to ruby. I used ruby-htmltools (think HTML::Tree), REXML (XPATH & XML trees) and HTTP Access2 (LWP). Building it using this set of libraries gave me a few ideas for W:M-

    At the moment, WWW::Mechanize doesn't try to do any general parse; it uses Tokeparser to rip out links and forms, HeadParser to do the head, etc, but doesn't really care much about the rest of the doc. One thing I found in porting, using a generic HTML parsing tool, then converting the result to an XML tree was that I could very easily pick out bits of content from the page. REXML allows the use of XPATH or XML::Twig/HTML::TreeBuilder-type scans of the parse tree, making it very easy to pick out, say, the title, or the number of <LI> elements with class 'green' in the <UL> tag with the id attribute 'foo'.

    Though I've never tried it, I'm sure it would be easy to use HTML::TreeBuilder to do something similar with WWW::Mechanize (IIRC it relies on HTML::Parser, too). Then, in your subclass of W:M, you have lovely convenience methods like $mech->title(). Oh, actually W:M has that already .... hmm, how about $mech->keywords()? Or $mech->get_element_by_id('foo')?. I dunno if this was the kind of thing you were thinking of, but I think it does extend the range of possibilities.

    cheers
    ViceRaid