simon.proctor has asked for the wisdom of the Perl Monks concerning the following question:

A while ago I had to build a customised HTML validation and strip tool that would work to very custom requirements. Consequently HTML Tidy wasn't exactly what we wanted and so I built a tool based on HTML::TreeBuilder.

The code runs fine but the output stage for printing has become quite monolithic (or at least if feels that way).

The code operates by first building the node tree and then processing it via walking (tramping?) over the tree and inserting/deleting nodes, adding/removing attributes and converting values (like colours) into what we want them to be. The second stage is to re-walk this tree and print it.

Its regarding this second stage that I've posted here. The page output has to have a very well formatted, clean output but the code that does this is quite complex. Currently, I have a recursive solution supported by lists that maintain a stack of ancestors to our current node.

So we call the routine with a node, pass a handful of if statements (creating output) determine if the node has children (recurse if we do), pass a few more if statements (creating output) and return. Needless to say that the if logic is getting increasingly complex and is getting less and less customisable.

Frankly I'm stumped as to how to refactor this. In fact I'm prepared to re-write the output stage from scratch but thought it best to get some advice first.

So does anyone have any ideas? I can post some code if required but I'm not sure how useful it would be.

Thanks in advance,

SP

  • Comment on Output of HTML tree built with TreeBuilder

Replies are listed 'Best First'.
Re: Output of HTML tree built with TreeBuilder
by dash2 (Hermit) on Jun 20, 2003 at 11:36 UTC
    It depends how much futureproofness you want. I assume the code works as it is, but is becoming hard to make more changes. I suggest that you write out a table of all the possible inputs into the decision as to how to write the node out.
    has parent || is marked comment || node tag || etc...
    On the right of the table, put all the possible results:
    indent with tabs || newline before || etc.
    Then, write out all the variations on the rows:
    has parent is marked comment node tag indent with tabs newline before ...
    yesyesayes...
    yesyesbyes...
    ...
    Once you've done this, you should be able to see what the main factors are which decide the differences in output, and refactor accordingly. (You can also use this technique to model what your future changes will do.)

    For example, if you find out that there are only 3 main output styles, then you can rewrite the subroutine to look at the inputs, and then call one of 3 subroutines (you could put them in a dispatch table in case you need more).

    Or, if you think the decision is more complex, you might want to create objects to decide how to output the code. For example, you could create NodeWriter::HasParent to write out nodes with parents. Maybe table cell nodes are handled slightly different, so NodeWriter::HasParent::Td could inherit but override some methods. Then you can decide which object to create:

    sub prepareOutput { my $self = shift; my ($node) = @_; my $writer = $self->create_nodewriter($node); $self->[OUTPUT] .= $writer->write_output($node); } sub create_nodewriter { my $self = shift; my ($node) = @_; $subtype = $node->parent? 'HasParent':'NoParent'; $tagtype = ucfirst $node->tag; $class = "NodeWriter::$subtype" . "::$tagtype"; return $class->new(); }

    In short what I am suggesting is: "separate policy from mechanism".

    A massive flamewar beneath your chosen depth has not been shown here
Output stage code (long)
by simon.proctor (Vicar) on Jun 20, 2003 at 10:42 UTC
    Here is the code. Its a 254 line function :).

    TAG - this is an object that contains the rules for controlling whether we can indent or not. It also provides one or two other convenience methods.

    $mod - if it isn't obvious, this controls our indentation level so the more child tags we have the more we indent (using tabs).