As explained in this node, I was asked to add an interface to HTML::StripScripts so that it would return either an XML::LibXML::Document object or an XML::LibXML::DocumentFragment object (for a snippet of HTML)

I've done this by subclassing HTML::StripScripts::Parser and overriding the output callbacks in HTML::StripScripts. So the flow is as follows:

HTML::StripScripts::Parser --> uses HTML::Parser to tokenise the HTML --> uses callbacks in HTML::StripScripts to filter the XSS an +d tidy the HTML --> uses callbacks in HTML::StripScripts::LibXML to build + a DOM tree
The code works, it passes the tests for HTML::StripScripts::Parser (with a couple of modifications, eg <b></b> in the XML->toString output is represented as <b/>).

The questions

I don't do much with XML, so I would appreciate feedback from XML-ers. My questions are:

The code

I have posted the code below, but it requires a newer version of HTML::StripScripts than the one currently on CPAN which works with the current version of HTML::StripScripts on CPAN. Also, just uploaded this module to CPAN as HTML::StripScripts::LibXML - available as soon as your mirror syncs.
package HTML::StripScripts::LibXML; use strict; use vars qw($VERSION); $VERSION = '0.10'; =head1 NAME HTML::StripScripts::LibXML - XSS filter - outputs a LibXML Document o +r DocumentFragment =head1 SYNOPSIS use HTML::StripScripts::LibXML(); my $hss = HTML::StripScripts::LibXML->new( { Context => 'Document', ## HTML::StripScripts configur +ation Rules => { ... }, }, strict_comment => 1, ## HTML::Parser options strict_names => 1, ); $hss->parse_file("foo.html"); $xml_doc = $hss->filtered_document; OR $xml_doc = $hss->filter_html($html); =head1 DESCRIPTION This class provides an easy interface to C<HTML::StripScripts>, using C<HTML::Parser> to parse the HTML, and returns an XML::LibXML::Documen +t or XML::LibXML::DocumentFragment. See L<HTML::Parser> for details of how to customise how the raw HTML i +s parsed into tags, and L<HTML::StripScripts> for details of how to customise t +he way those tags are filtered. This module is a subclass of L<HTML::StripScripts::Parser>. =cut =head1 DIFFERENCES FROM HTML::StripScripts =over =item CONTEXT HTML::StripScripts::LibXML still allows you to specify the C<Context> +of the HTML (Document, Flow, Inline, NoTags). If C<Context> is C<Document>, t +hen it returns an C<XML::LibXML::Document> object, otherwise it returns an C<XML::LibXML::DocumentFragment> object. =item TAG CALLBACKS HTML::StripScripts allows you to use tag callbacks, for instance: $hss = HTML::StripScripts->new({ Rules => { a => \&a_callback } }); sub a_callback { my ($filter,$element) = @_; # where $element = { # tag => 'a', # attr => { href => '/index.html' }, # content => 'Go to <b>Home</b> page', # } return 1; } HTML::StripScripts::LibXML still gives you tag callbacks, but they lo +ok like this: sub a_callback { my ($filter,$element) = @_; # where $element = { # tag => 'a', # attr => { href => '/index.html' }, # children => [ # XML::LibXML::Text --> 'Go to ', # XML::LibXML::Element --> 'b' # with child Text --> 'Home', # XML::LibXML::Text --> ' page', # ], # } return 1; } =item SUBCLASSING The subs C<output>, C<output_start> and C<output_end> are not called. + Instead, this module uses C<output_stack_entry> which handles the tag callback, + (and depending on the result of the tag callback) creates an element and ad +ds its child nodes. Then it adds the element to the list of children for +the parent tag. =back =head1 CONSTRUCTORS =over =item new ( {CONFIG}, [PARSER_OPTIONS] ) Creates a new C<HTML::StripScripts::LibXML> object. See L<HTML::StripScripts::Parser> for details. =back =cut use base 'HTML::StripScripts::Parser'; use XML::LibXML(); use HTML::Entities(); #=================================== sub output_start_document { #=================================== my ($self) = @_; $self->{_hsxXML} = XML::LibXML::Document->new(); return; } #=================================== sub output_end_document { #=================================== my ($self) = @_; my $top = $self->{_hssStack}[0]; my $document = delete $self->{_hsxXML}; if ( $top->{CTX} ne 'Document' ) { $document = $document->createDocumentFragment(); } foreach my $child ( @{ $top->{CHILDREN} } ) { $document->addChild($child); } $top->{CONTENT} = $document; return; } #=================================== sub output_start { } *output_end = \&output_start; *output_declaration = \&output_start; *output_process = \&output_start; *output = \&output_start; #=================================== my $Entities = { 'amp' => '&', 'lt' => '<', 'gt' => '>', 'quot' => '"', '#39' => "'", }; #=================================== sub output_text { #=================================== my ( $self, $text ) = @_; HTML::Entities::_decode_entities( $text, $Entities ); push @{ $self->{_hssStack}[0]{CHILDREN} }, $self->{_hsxXML}->createTextNode($text); return; } #=================================== sub output_comment { #=================================== my ( $self, $comment ) = @_; $comment =~ s/^\s*<!--//g; $comment =~ s/-->\s*$//g; push @{ $self->{_hssStack}[0]{CHILDREN} }, $self->{_hsxXML}->createComment($comment); return; } #=================================== sub output_stack_entry { #=================================== my ( $self, $tag ) = @_; my %entry; $tag->{CHILDREN} ||= []; @entry{qw(tag attr children)} = @{$tag}{qw(NAME ATTR CHILDREN)}; if ( my $tag_callback = $tag->{CALLBACK} ) { $tag_callback->( $self, \%entry ) or return; } if ( my $tagname = $entry{tag} ) { my $element = $self->{_hsxXML}->createElement($tagname); my $attrs = $entry{attr}; foreach my $name ( sort keys %$attrs ) { $element->setAttribute( $name => $attrs->{$name} ); } unless ( $tag->{CTX} eq 'EMPTY' ) { foreach my $children ( @{ $entry{children} } ) { $element->addChild($children); } } push @{ $self->{_hssStack}[0]{CHILDREN} }, $element; } else { push @{ $self->{_hssStack}[0]{CHILDREN} }, @{ $entry{children} + }; } $tag->{CHILDREN} = []; } =head1 BUGS AND LIMITATIONS =over =item API - BETA This is the first draft of this module, and currently there are no con +figuration options for the XML. I would welcome feedback from XML users as to how + I could improve the interface. For this reason, the API may change. =back =head1 SEE ALSO L<HTML::Parser>, L<HTML::StripScripts::Parser>, L<HTML::StripScripts::Regex> =head1 AUTHOR Clinton Gormley E<lt>clint@traveljury.comE<gt> =head1 COPYRIGHT Copyright (C) 2007 Clinton Gormley. All Rights Reserved. =head1 LICENSE This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;
UPDATE: Uploaded this module to CPAN as HTML::StripScripts::LibXML, and removed the requirement for a new version of HTML::StripScripts.

In reply to RFC: HTML::StripScripts::LibXML by clinton

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.