Returning an XML::LibXML::DocumentFragment from HTML::StripScripts

clinton has asked for the wisdom of the Perl Monks concerning the following question:

I've had a request to write a subclass to HTML::StripScripts so that it will return an XML::LibXML::DocumentFragment instead of straight HTML text.

HTML::StripScripts accepts tokens (eg from HTML::Parser), processes them and returns the HTML in string form. I'd like to be able to return XML::LibXML elements without libxml having to reparse the HTML

Questions;

Would parsing dodgy HTML be any more flexible with HTML::Parser than with XML::LibXML in recover mode?
Assuming I use HTML::Parser to tokenise the HTML, what is the best way to return XML::LibXML::Element objects without having to reparse the returned HTML?

What I was thinking along these lines:


    $self->{_HXS_parent} = $context eq 'Document'
                        ? XML::LibXML::Document->new($version,$encodin
+g)
                        : XML::LibXML::DocumentFragment->new();
[download]

and


   sub output_element {
       my ( $self, $tag, $attributes, $content ) = @_;
       my $element = XML::LibXML::Element->new( $tag );

       for my $key ( keys %$attributes ) {
             $element->setAttribute( $key, $attributes->{$key} );
       }
       $self->{ _HXS_parent }->append_child( $element );
  }
[download]

I know that this doesn't take into account the whole nesting thing, but what I'm really wanting to know is, would this be the fastest way to generate the document/fragment, or would it be faster to reparse the output in XS with $parser->parse_html_string? It would certainly be a whole lot simpler!

thanks

Clint

Comment on Returning an XML::LibXML::DocumentFragment from HTML::StripScripts Select or Download Code

Replies are listed 'Best First'.
Re: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts by Anonymous Monk on Jun 27, 2007 at 06:40 UTC
Hey Clint, The answer to your first question is that XML::LibXML will not parse tag soups in recover mode. Therefore, HTML::Parser is a better solution for dodgy HTML. The idea of XML::LibXML's recover mode is to recover XML malformed documents up to the point when non-XML code starts. This is usually a closed tag in a location where the parser would not expect it. In recover mode XML::LibXML will return the code that has been parsed successfully, instead of throwing an error. Anything that comes behind that error is ignored. Your second question has several answers and is quite complex. DocumentFragments cannot exist without an owning document when using XML::LibXML. That means that you have to create a temporary document in order to create a document fragment. The following code line provides an example. `my $tempdoc = XML::LibXML->createDocument($version,$encoding); my $docfrag = $tempdoc->createDocumentFragment();` [download] You should also use DOM functions create element nodes, text nodes etc. instead of `my $element = XML::LibXML::Element->new( $tag );` [download] The following line is the most generic way following the DOM paradigm. `my $element = $self->{ _HXS_parent }->ownerDocument->createElement($ +tag);` [download] If $tempdoc is available, the statement can be simplified to the following line. `my $element = $tempdoc->createElement($tag);` [download] As a result of this process either the result document or the document fragment will be returned. Although this is the answer to your second question, I would suggest a different approach. Instead of DOM functions to create the result data, use a SAX pipeline to handle the HTML tokens. In that case all the tricky DOM building is left to the good old XML::LibXML::SAX::Builder which comes with XML::LibXML (or to any other SAX handler such as XML::SAX::Writer). The idea behind SAX is that any token is defined by two boundaries (which are the tags in XML). A SAX parser indicates these boundaries by sending "start_element" and "end_element" events to a SAX pipeline. When a SAX handler in the pipeline receives these events, the handler translates them into whatever is appropriate (e.g. create DOM nodes or write XML data to an IO stream). Although this sounds complex, for the parser side XML::SAX::Base does all the abstraction, so token handling can be reduced (more or less) to the following code: @MYCLASS::ISA = qw(XML::SAX::Base); sub static_handler { $self->set_handler(XML::LibXML::SAX::Builder->new()); } sub result { $self->get_handler()->result(); # will return a DOM structure } sub found_token { my ($self, $tag, $attributes) = @_; $self->check_active_tokens(); # and end them if required my $e = _element($tag); for my $key ( keys %$attributes ) { _add_attribute($e, $key, $attributes->{$key} ); } $self->start_element($e); } sub check_active_token { # this allows no nested tags. # tag soup logic should end up in this function foreach my $tag ( @{$self->{_TOKENS_}} ) { $self->end_element( _element($tag, 1) ); } } # the following functions I stole from Matt's code ;) # it simplifies the handling of SAX data structures for # those cases that don't require namespace handling. sub _element { my ($name, $end) = @_; return { Name => $name, LocalName => $name, $end ? () : (Attributes => {}), NamespaceURI => '', Prefix => '', }; } sub _add_attrib { my ($el, $name, $value) = @_; $el->{Attributes}{"{}$name"} = { Name => $name, LocalName => $name, Prefix => "", NamespaceURI => '', Value => $value, }; return $el; } [download] Of course the application logic of the final code will be more complex. But this outline shows the principle. The advantage of this approach is that you can avoid most recursions that are necessary for DOM building from semistructured data. For instance, I use this approach to parse WIKI code into DOM structures. Christian -- PHISH @ CPAN	[reply] [d/l] [select]
Re^2: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts by clinton (Priest) on Jun 27, 2007 at 07:32 UTC
Christian - many thanks for the detailed reply. That's brilliant - exactly what I was after. Building the tree in a stream would fit well with how HTML::StripScripts works at the moment. It's obvious that my XML::LibXML experience is pretty basic, and my XML::SAX experience even less, so I'm very grateful for the direction. Clint	[reply]
Re^2: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts by clinton (Priest) on Jul 01, 2007 at 20:12 UTC
While trying to implement this with XML::SAX::Builder, I ran into a problem. Parsing a SAX stream looks like this: `<p> <i> Italics <b> plus bold </b> </i> </p> start_document start_element : p start_element : i characters : Italics start_element : b characters : plus bold end_element : b end_element : i end_element : p end_document` [download] The HSS stream looks like this: `start_document start_element : b content : plus Bold end_element : b start_element : i content : Italics content : <b>plus Bold</b> end_element : i start_element : p content : <i>Italics <b>plus Bold</b></i> end_element : p end_document` [download] The reason for that, is my tag callbacks, which give you the ability to change a tag and its contents, delete the tag, and or its contents etc. The callback needs access to the contents/child nodes. So the tree is built from the leaves down, rather than in the traditional manner. I've implemented a solution, which I have posted here: RFC: HTML::StripScripts::LibXML Clint	[reply] [d/l] [select]
Re: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts by ForgotPasswordAgain (Vicar) on Jun 27, 2007 at 09:14 UTC
I don't see why it matters which way is faster. Do you have complaints that it's too slow? It looks like HTML::StripScripts itself already adds a whole lot of overhead.	[reply]
Re^2: Returning an XML::LibXML::DocumentFragment from HTML::StripScripts by clinton (Priest) on Jun 27, 2007 at 09:19 UTC
Do you have complaints that it's too slow? It looks like HTML::StripScripts itself already adds a whole lot of overhead. No complaints yet, and I'm aware of the overhead required. But when adding a new feature, I'd like to add it in the most efficient way possible. Clint	[reply]