Customizing WWW::Mechanize

WWW::Mechanize is a wonderful tool. However, there are all sorts of little things that I like to add that extend beyond what it's intended to do. Some of these issues are "Perl only" (such as Detecting stringified references with WWW::Mechanize), and others are more generic, such as detecting when an HTML entity has been encoded twice (e.g., &amp; -- we had methods that shouldn't be encoding their output, but they did and then they got encoded again in Mason).

To deal with the double-encoding problem, I added a very simple line to my overloaded content() method (see above link):

    if ($content =~ /(&amp;(?:[lg]t|amp|quot);)/) {
        carp "Possible double-encoded HTML entity ($1) found in result
+s";
    }
[download]

However, I can think of several other potential issues I would like to check and make warnings for them optional by instantiating my subclass with a hashref and suppressing warnings that I don't want (maybe a page is trying to display HTML entity codes and wants them double-encoded):

  my $mech = WWW::Mechanize::Warnings->new(
      stringified_references => 1,
      valid_dtd              => 1,
      encoded_entities       => 0,
      valid_html             => 1,
  );
[download]

I think the warnings should be on by default, so only the "encoded_entities" key/value pair would be strictly necessary in the hypothetical constructor above.

What else would you want to see in such a module? If it's anything tricky, implementation ideas would be welcome.

Cheers,
Ovid

New address of my CGI Course.

Comment on Customizing WWW::Mechanize Select or Download Code

Replies are listed 'Best First'.
Re: Customizing WWW::Mechanize by jonadab (Parson) on Jan 13, 2004 at 02:01 UTC
There are a number of ways to extend WWW::Mechanize, and this is a good one. I can't think of any other specific warnings (unless you want to turn it into a validator), but I did run into a situation the other day wherein a different type of extension would have saved me time. I was writing a script that fetches usage statistics for some online databases that the library subscribes to, and at one point it needed to follow a link like this: `<a href="javascript:doAdminHyperlink('/eadmin/Profiles/CustomizeServicesDatabasesForm.aspx','','')">` As an added bonus, doAdminHyperlink was defined in an external .js file that was linked in. Now, browsers have no trouble with this, but I had to retrieve the script, find the function, and translate into Perl the important parts of what it was doing. (It changes a couple of the form's values and submits it.) I ended up doing $mech->set_fields( some stuff); $mech->click(); but I had to read a twenty-some-line function in a language I don't really know in order to find out that that was what I needed to do. So, an ECMA script and DOM extension for WWW::Mechanize would be nifty. It would have to be optional, of course. Note that I'm not asking you to add this to your warnings module :-) Just pointing out another example of what you're driving at: WWW::Mechanize is a really cool beginning, but it could be much more. `$;=sub{$/};@;=map{my($a,$b)=($_,$;);$;=sub{$a.$b->()}} split//,".rekcah lreP rehtona tsuJ";$\=$ ;->();print$/` [download]	[reply] [d/l] [select]
Re: Customizing WWW::Mechanize by clscott (Friar) on Jan 13, 2004 at 03:26 UTC
Wouldn't this be the domain of HTML::Lint ? It is by the same author as WWW::Mechanise I'm sure there's a reason that they are separate. HTML::Lint - check for HTML errors in a string or file The HTML::Lint distribution comes with Test::HTML::Lint a Test::More-style wrapper around HTML::Lint. -- Clayton	[reply]
Re: Customizing WWW::Mechanize by revdiablo (Prior) on Jan 13, 2004 at 06:23 UTC
when an HTML entity has been encoded twice (e.g., &amp; ...) Kind of an off-topic reply, but please note that sometimes this is not an error. I ran into this very thing with my website. I had similar code in place to check for double-encoded entities, but it falsely triggered when I actually wanted `&amp;`, for instance if I wanted to have `&` rendered in the HTML output (without resorting to tricks like PM's <code> tags, or similar).	[reply] [d/l] [select]
Re: Customizing WWW::Mechanize by ViceRaid (Chaplain) on Jan 13, 2004 at 17:49 UTC
Afternoon I'm a big fan of Mechanize as well, enough so that I started porting it to ruby. I used ruby-htmltools (think HTML::Tree), REXML (XPATH & XML trees) and HTTP Access2 (LWP). Building it using this set of libraries gave me a few ideas for W:M- At the moment, WWW::Mechanize doesn't try to do any general parse; it uses Tokeparser to rip out links and forms, HeadParser to do the head, etc, but doesn't really care much about the rest of the doc. One thing I found in porting, using a generic HTML parsing tool, then converting the result to an XML tree was that I could very easily pick out bits of content from the page. REXML allows the use of XPATH or XML::Twig/HTML::TreeBuilder-type scans of the parse tree, making it very easy to pick out, say, the title, or the number of <LI> elements with class 'green' in the <UL> tag with the `id` attribute 'foo'. Though I've never tried it, I'm sure it would be easy to use HTML::TreeBuilder to do something similar with WWW::Mechanize (IIRC it relies on HTML::Parser, too). Then, in your subclass of W:M, you have lovely convenience methods like `$mech->title()`. Oh, actually W:M has that already .... hmm, how about `$mech->keywords()`? Or `$mech->get_element_by_id('foo')`?. I dunno if this was the kind of thing you were thinking of, but I think it does extend the range of possibilities. cheers ViceRaid	[reply]