comment on

As explained in this node, I was asked to add an interface to HTML::StripScripts so that it would return either an XML::LibXML::Document object or an XML::LibXML::DocumentFragment object (for a snippet of HTML)

I've done this by subclassing HTML::StripScripts::Parser and overriding the output callbacks in HTML::StripScripts. So the flow is as follows:


   HTML::StripScripts::Parser
     --> uses HTML::Parser to tokenise the HTML
         --> uses callbacks in HTML::StripScripts to filter the XSS an
+d tidy the HTML
             --> uses callbacks in HTML::StripScripts::LibXML to build
+ a DOM tree
[download]

The code works, it passes the tests for HTML::StripScripts::Parser (with a couple of modifications, eg  in the XML->toString output is represented as ).

The questions

I don't do much with XML, so I would appreciate feedback from XML-ers. My questions are:

I currently accept no options to pass to LibXML. What should I accept? What would be useful? Encoding?
See sub output_text in the code. This sub receives an HTML-entity encoded string. Currently I decode &, <, >, ", ' before calling $document->createTextNode($text). Everything else gets passed straight through. Should I be doing something with Encode here, for instance accepting a source character set and a destination character set, and applying a conversion both here and to attribute values?
For purposes of documentation, how do I refer to what is returned? A DOM tree? An XML document?
Should this module be called HTML::StripScripts::LibXML/XML/DOM?
As merlyn said in the CB, an empty  tag isn't valid because what is a paragraph without content. Should I be deleting tags that are empty but aren't defined as empty (eg   or <img />?

Tag callbacks (ie a sub that is called every time a particular tag is found) are different for this module.

In HTML::StripScripts:



   $hss = HTML::StripScripts->new({
       Rules => { a => \&a_callback }
   });

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          content  => 'Go to <b>Home</b> page',
        #       }
        return 1;
    }
[download]

In HTML::StripScripts::LibXML:


   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          children => [
        #                        XML::LibXML::Text -->    'Go to ',
        #                        XML::LibXML::Element --> 'b'
        #                           with child Text -->   'Home',
        #                        XML::LibXML::Text -->    ' page',
        #                      ],
        #       }
        return 1;
    }
[download]

The reason I don't create the element and add its children before sending it to the callback is that the callback has the option of deleting the tag, which would just remove this tag, but preserve its contents/children.

Would it make more sense to create the element regardless, set its attributes, add the children and then send it to the callback. And if the callback decides to delete the tag, then the children would just get added to the parent element?

The code

I have posted the code below, ~~but it requires a newer version of HTML::StripScripts than the one currently on CPAN~~ which works with the current version of HTML::StripScripts on CPAN. Also, just uploaded this module to CPAN as HTML::StripScripts::LibXML - available as soon as your mirror syncs.

package HTML::StripScripts::LibXML;
use strict;

use vars qw($VERSION);
$VERSION = '0.10';

=head1 NAME

HTML::StripScripts::LibXML - XSS filter -  outputs a LibXML Document o
+r DocumentFragment

=head1 SYNOPSIS

  use HTML::StripScripts::LibXML();

  my $hss = HTML::StripScripts::LibXML->new(

       {
           Context => 'Document',       ## HTML::StripScripts configur
+ation
           Rules   => { ... },
       },

       strict_comment => 1,             ## HTML::Parser options
       strict_names   => 1,

  );

  $hss->parse_file("foo.html");
  $xml_doc = $hss->filtered_document;

  OR

  $xml_doc = $hss->filter_html($html);

=head1 DESCRIPTION

This class provides an easy interface to C<HTML::StripScripts>, using
C<HTML::Parser> to parse the HTML, and returns an XML::LibXML::Documen
+t
or XML::LibXML::DocumentFragment.

See L<HTML::Parser> for details of how to customise how the raw HTML i
+s parsed
into tags, and L<HTML::StripScripts> for details of how to customise t
+he way
those tags are filtered. This module is a subclass of
L<HTML::StripScripts::Parser>.

=cut

=head1 DIFFERENCES FROM HTML::StripScripts

=over

=item CONTEXT

HTML::StripScripts::LibXML still allows you to specify the C<Context> 
+of the
HTML (Document, Flow, Inline, NoTags). If C<Context> is C<Document>, t
+hen it
returns an C<XML::LibXML::Document> object, otherwise it returns an
C<XML::LibXML::DocumentFragment> object.

=item TAG CALLBACKS

HTML::StripScripts allows you to use tag callbacks, for instance:

   $hss = HTML::StripScripts->new({
       Rules => { a => \&a_callback }
   });

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          content  => 'Go to <b>Home</b> page',
        #       }
        return 1;
    }

HTML::StripScripts::LibXML still gives you tag callbacks,  but they lo
+ok like
this:

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          children => [
        #                        XML::LibXML::Text -->    'Go to ',
        #                        XML::LibXML::Element --> 'b'
        #                           with child Text -->   'Home',
        #                        XML::LibXML::Text -->    ' page',
        #                      ],
        #       }
        return 1;
    }

=item SUBCLASSING

The subs C<output>, C<output_start> and C<output_end> are not called. 
+ Instead,
this module uses C<output_stack_entry> which handles the tag callback,
+ (and
depending on the result of the tag callback) creates an element and ad
+ds
its child nodes. Then it adds the element to the list of children for 
+the
parent tag.

=back

=head1 CONSTRUCTORS

=over

=item new ( {CONFIG}, [PARSER_OPTIONS]  )

Creates a new C<HTML::StripScripts::LibXML> object.

See L<HTML::StripScripts::Parser> for details.

=back

=cut

use base 'HTML::StripScripts::Parser';
use XML::LibXML();
use HTML::Entities();

#===================================
sub output_start_document {
#===================================
    my ($self) = @_;
    $self->{_hsxXML} = XML::LibXML::Document->new();
    return;
}

#===================================
sub output_end_document {
#===================================
    my ($self)   = @_;
    my $top      = $self->{_hssStack}[0];
    my $document = delete $self->{_hsxXML};

    if ( $top->{CTX} ne 'Document' ) {
        $document = $document->createDocumentFragment();
    }

    foreach my $child ( @{ $top->{CHILDREN} } ) {
        $document->addChild($child);
    }
    $top->{CONTENT} = $document;
    return;
}

#===================================
sub output_start { }
*output_end         = \&output_start;
*output_declaration = \&output_start;
*output_process     = \&output_start;
*output             = \&output_start;
#===================================

my $Entities = { 'amp'  => '&',
                 'lt'   => '<',
                 'gt'   => '>',
                 'quot' => '"',
                 '#39'  => "'",
};
#===================================
sub output_text {
#===================================
    my ( $self, $text ) = @_;
    HTML::Entities::_decode_entities( $text, $Entities );
    push @{ $self->{_hssStack}[0]{CHILDREN} },
        $self->{_hsxXML}->createTextNode($text);
    return;
}

#===================================
sub output_comment {
#===================================
    my ( $self, $comment ) = @_;
    $comment =~ s/^\s*<!--//g;
    $comment =~ s/-->\s*$//g;
    push @{ $self->{_hssStack}[0]{CHILDREN} },
        $self->{_hsxXML}->createComment($comment);
    return;
}

#===================================
sub output_stack_entry {
#===================================
    my ( $self, $tag ) = @_;

    my %entry;
    $tag->{CHILDREN} ||= [];
    @entry{qw(tag attr children)} = @{$tag}{qw(NAME ATTR CHILDREN)};

    if ( my $tag_callback = $tag->{CALLBACK} ) {
        $tag_callback->( $self, \%entry )
            or return;
    }

    if ( my $tagname = $entry{tag} ) {
        my $element = $self->{_hsxXML}->createElement($tagname);
        my $attrs   = $entry{attr};
        foreach my $name ( sort keys %$attrs ) {
            $element->setAttribute( $name => $attrs->{$name} );
        }
        unless ( $tag->{CTX} eq 'EMPTY' ) {
            foreach my $children ( @{ $entry{children} } ) {
                $element->addChild($children);
            }
        }
        push @{ $self->{_hssStack}[0]{CHILDREN} }, $element;
    }
    else {
        push @{ $self->{_hssStack}[0]{CHILDREN} }, @{ $entry{children}
+ };
    }
    $tag->{CHILDREN} = [];
}

=head1 BUGS AND LIMITATIONS

=over

=item API - BETA

This is the first draft of this module, and currently there are no con
+figuration
options for the XML. I would welcome feedback from XML users as to how
+ I could
improve the interface.

For this reason, the API may change.

=back

=head1 SEE ALSO

L<HTML::Parser>, L<HTML::StripScripts::Parser>,
L<HTML::StripScripts::Regex>

=head1 AUTHOR

Clinton Gormley E<lt>clint@traveljury.comE<gt>

=head1 COPYRIGHT

Copyright (C) 2007 Clinton Gormley.  All Rights Reserved.

=head1 LICENSE

This module is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

1;
[download]

UPDATE: Uploaded this module to CPAN as HTML::StripScripts::LibXML, and removed the requirement for a new version of HTML::StripScripts.

In reply to RFC: HTML::StripScripts::LibXML by clinton

Posts are HTML formatted. Put   tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.