clinton has asked for the wisdom of the Perl Monks concerning the following question:

I've had a request to write a subclass to HTML::StripScripts so that it will return an XML::LibXML::DocumentFragment instead of straight HTML text.

HTML::StripScripts accepts tokens (eg from HTML::Parser), processes them and returns the HTML in string form. I'd like to be able to return XML::LibXML elements without libxml having to reparse the HTML

Questions;

What I was thinking along these lines:

$self->{_HXS_parent} = $context eq 'Document' ? XML::LibXML::Document->new($version,$encodin +g) : XML::LibXML::DocumentFragment->new();
and
sub output_element { my ( $self, $tag, $attributes, $content ) = @_; my $element = XML::LibXML::Element->new( $tag ); for my $key ( keys %$attributes ) { $element->setAttribute( $key, $attributes->{$key} ); } $self->{ _HXS_parent }->append_child( $element ); }
I know that this doesn't take into account the whole nesting thing, but what I'm really wanting to know is, would this be the fastest way to generate the document/fragment, or would it be faster to reparse the output in XS with $parser->parse_html_string? It would certainly be a whole lot simpler!

thanks

Clint

Replies are listed 'Best First'.
Re: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts
by Anonymous Monk on Jun 27, 2007 at 06:40 UTC
    Hey Clint,

    The answer to your first question is that XML::LibXML will not parse tag soups in recover mode. Therefore, HTML::Parser is a better solution for dodgy HTML.

    The idea of XML::LibXML's recover mode is to recover XML malformed documents up to the point when non-XML code starts. This is usually a closed tag in a location where the parser would not expect it. In recover mode XML::LibXML will return the code that has been parsed successfully, instead of throwing an error. Anything that comes behind that error is ignored.

    Your second question has several answers and is quite complex.

    DocumentFragments cannot exist without an owning document when using XML::LibXML.

    That means that you have to create a temporary document in order to create a document fragment. The following code line provides an example.

    my $tempdoc = XML::LibXML->createDocument($version,$encoding); my $docfrag = $tempdoc->createDocumentFragment();
    You should also use DOM functions create element nodes, text nodes etc. instead of
    my $element = XML::LibXML::Element->new( $tag );
    The following line is the most generic way following the DOM paradigm.
    my $element = $self->{ _HXS_parent }->ownerDocument->createElement($ +tag);
    If $tempdoc is available, the statement can be simplified to the following line.
    my $element = $tempdoc->createElement($tag);
    As a result of this process either the result document or the document fragment will be returned.

    Although this is the answer to your second question, I would suggest a different approach. Instead of DOM functions to create the result data, use a SAX pipeline to handle the HTML tokens. In that case all the tricky DOM building is left to the good old XML::LibXML::SAX::Builder which comes with XML::LibXML (or to any other SAX handler such as XML::SAX::Writer).

    The idea behind SAX is that any token is defined by two boundaries (which are the tags in XML). A SAX parser indicates these boundaries by sending "start_element" and "end_element" events to a SAX pipeline. When a SAX handler in the pipeline receives these events, the handler translates them into whatever is appropriate (e.g. create DOM nodes or write XML data to an IO stream). Although this sounds complex, for the parser side XML::SAX::Base does all the abstraction, so token handling can be reduced (more or less) to the following code:

    @MYCLASS::ISA = qw(XML::SAX::Base); sub static_handler { $self->set_handler(XML::LibXML::SAX::Builder->new()); } sub result { $self->get_handler()->result(); # will return a DOM structure } sub found_token { my ($self, $tag, $attributes) = @_; $self->check_active_tokens(); # and end them if required my $e = _element($tag); for my $key ( keys %$attributes ) { _add_attribute($e, $key, $attributes->{$key} ); } $self->start_element($e); } sub check_active_token { # this allows no nested tags. # tag soup logic should end up in this function foreach my $tag ( @{$self->{_TOKENS_}} ) { $self->end_element( _element($tag, 1) ); } } # the following functions I stole from Matt's code ;) # it simplifies the handling of SAX data structures for # those cases that don't require namespace handling. sub _element { my ($name, $end) = @_; return { Name => $name, LocalName => $name, $end ? () : (Attributes => {}), NamespaceURI => '', Prefix => '', }; } sub _add_attrib { my ($el, $name, $value) = @_; $el->{Attributes}{"{}$name"} = { Name => $name, LocalName => $name, Prefix => "", NamespaceURI => '', Value => $value, }; return $el; }
    Of course the application logic of the final code will be more complex. But this outline shows the principle.

    The advantage of this approach is that you can avoid most recursions that are necessary for DOM building from semistructured data. For instance, I use this approach to parse WIKI code into DOM structures.

    Christian
    --
    PHISH @ CPAN

      Christian - many thanks for the detailed reply. That's brilliant - exactly what I was after. Building the tree in a stream would fit well with how HTML::StripScripts works at the moment.

      It's obvious that my XML::LibXML experience is pretty basic, and my XML::SAX experience even less, so I'm very grateful for the direction.

      Clint

      While trying to implement this with XML::SAX::Builder, I ran into a problem.

      Parsing a SAX stream looks like this:

      <p> <i> Italics <b> plus bold </b> </i> </p> start_document start_element : p start_element : i characters : Italics start_element : b characters : plus bold end_element : b end_element : i end_element : p end_document

      The HSS stream looks like this:

      start_document start_element : b content : plus Bold end_element : b start_element : i content : Italics content : <b>plus Bold</b> end_element : i start_element : p content : <i>Italics <b>plus Bold</b></i> end_element : p end_document

      The reason for that, is my tag callbacks, which give you the ability to change a tag and its contents, delete the tag, and or its contents etc. The callback needs access to the contents/child nodes.

      So the tree is built from the leaves down, rather than in the traditional manner.

      I've implemented a solution, which I have posted here: RFC: HTML::StripScripts::LibXML

      Clint

Re: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts
by ForgotPasswordAgain (Vicar) on Jun 27, 2007 at 09:14 UTC

    I don't see why it matters which way is faster. Do you have complaints that it's too slow? It looks like HTML::StripScripts itself already adds a whole lot of overhead.

      Do you have complaints that it's too slow? It looks like HTML::StripScripts itself already adds a whole lot of overhead.

      No complaints yet, and I'm aware of the overhead required. But when adding a new feature, I'd like to add it in the most efficient way possible.

      Clint