RFC: HTML::StripScripts::LibXML

As explained in this node, I was asked to add an interface to HTML::StripScripts so that it would return either an XML::LibXML::Document object or an XML::LibXML::DocumentFragment object (for a snippet of HTML)

I've done this by subclassing HTML::StripScripts::Parser and overriding the output callbacks in HTML::StripScripts. So the flow is as follows:


   HTML::StripScripts::Parser
     --> uses HTML::Parser to tokenise the HTML
         --> uses callbacks in HTML::StripScripts to filter the XSS an
+d tidy the HTML
             --> uses callbacks in HTML::StripScripts::LibXML to build
+ a DOM tree
[download]

The code works, it passes the tests for HTML::StripScripts::Parser (with a couple of modifications, eg  in the XML->toString output is represented as ).

The questions

I don't do much with XML, so I would appreciate feedback from XML-ers. My questions are:

I currently accept no options to pass to LibXML. What should I accept? What would be useful? Encoding?
See sub output_text in the code. This sub receives an HTML-entity encoded string. Currently I decode &, <, >, ", ' before calling $document->createTextNode($text). Everything else gets passed straight through. Should I be doing something with Encode here, for instance accepting a source character set and a destination character set, and applying a conversion both here and to attribute values?
For purposes of documentation, how do I refer to what is returned? A DOM tree? An XML document?
Should this module be called HTML::StripScripts::LibXML/XML/DOM?
As merlyn said in the CB, an empty  tag isn't valid because what is a paragraph without content. Should I be deleting tags that are empty but aren't defined as empty (eg   or <img />?

Tag callbacks (ie a sub that is called every time a particular tag is found) are different for this module.

In HTML::StripScripts:



   $hss = HTML::StripScripts->new({
       Rules => { a => \&a_callback }
   });

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          content  => 'Go to <b>Home</b> page',
        #       }
        return 1;
    }
[download]

In HTML::StripScripts::LibXML:


   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          children => [
        #                        XML::LibXML::Text -->    'Go to ',
        #                        XML::LibXML::Element --> 'b'
        #                           with child Text -->   'Home',
        #                        XML::LibXML::Text -->    ' page',
        #                      ],
        #       }
        return 1;
    }
[download]

The reason I don't create the element and add its children before sending it to the callback is that the callback has the option of deleting the tag, which would just remove this tag, but preserve its contents/children.

Would it make more sense to create the element regardless, set its attributes, add the children and then send it to the callback. And if the callback decides to delete the tag, then the children would just get added to the parent element?

The code

I have posted the code below, ~~but it requires a newer version of HTML::StripScripts than the one currently on CPAN~~ which works with the current version of HTML::StripScripts on CPAN. Also, just uploaded this module to CPAN as HTML::StripScripts::LibXML - available as soon as your mirror syncs.

package HTML::StripScripts::LibXML;
use strict;

use vars qw($VERSION);
$VERSION = '0.10';

=head1 NAME

HTML::StripScripts::LibXML - XSS filter -  outputs a LibXML Document o
+r DocumentFragment

=head1 SYNOPSIS

  use HTML::StripScripts::LibXML();

  my $hss = HTML::StripScripts::LibXML->new(

       {
           Context => 'Document',       ## HTML::StripScripts configur
+ation
           Rules   => { ... },
       },

       strict_comment => 1,             ## HTML::Parser options
       strict_names   => 1,

  );

  $hss->parse_file("foo.html");
  $xml_doc = $hss->filtered_document;

  OR

  $xml_doc = $hss->filter_html($html);

=head1 DESCRIPTION

This class provides an easy interface to C<HTML::StripScripts>, using
C<HTML::Parser> to parse the HTML, and returns an XML::LibXML::Documen
+t
or XML::LibXML::DocumentFragment.

See L<HTML::Parser> for details of how to customise how the raw HTML i
+s parsed
into tags, and L<HTML::StripScripts> for details of how to customise t
+he way
those tags are filtered. This module is a subclass of
L<HTML::StripScripts::Parser>.

=cut

=head1 DIFFERENCES FROM HTML::StripScripts

=over

=item CONTEXT

HTML::StripScripts::LibXML still allows you to specify the C<Context> 
+of the
HTML (Document, Flow, Inline, NoTags). If C<Context> is C<Document>, t
+hen it
returns an C<XML::LibXML::Document> object, otherwise it returns an
C<XML::LibXML::DocumentFragment> object.

=item TAG CALLBACKS

HTML::StripScripts allows you to use tag callbacks, for instance:

   $hss = HTML::StripScripts->new({
       Rules => { a => \&a_callback }
   });

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          content  => 'Go to <b>Home</b> page',
        #       }
        return 1;
    }

HTML::StripScripts::LibXML still gives you tag callbacks,  but they lo
+ok like
this:

   sub a_callback {
        my ($filter,$element)  = @_;
        # where $element = {
        #          tag      => 'a',
        #          attr     => { href => '/index.html' },
        #          children => [
        #                        XML::LibXML::Text -->    'Go to ',
        #                        XML::LibXML::Element --> 'b'
        #                           with child Text -->   'Home',
        #                        XML::LibXML::Text -->    ' page',
        #                      ],
        #       }
        return 1;
    }

=item SUBCLASSING

The subs C<output>, C<output_start> and C<output_end> are not called. 
+ Instead,
this module uses C<output_stack_entry> which handles the tag callback,
+ (and
depending on the result of the tag callback) creates an element and ad
+ds
its child nodes. Then it adds the element to the list of children for 
+the
parent tag.

=back

=head1 CONSTRUCTORS

=over

=item new ( {CONFIG}, [PARSER_OPTIONS]  )

Creates a new C<HTML::StripScripts::LibXML> object.

See L<HTML::StripScripts::Parser> for details.

=back

=cut

use base 'HTML::StripScripts::Parser';
use XML::LibXML();
use HTML::Entities();

#===================================
sub output_start_document {
#===================================
    my ($self) = @_;
    $self->{_hsxXML} = XML::LibXML::Document->new();
    return;
}

#===================================
sub output_end_document {
#===================================
    my ($self)   = @_;
    my $top      = $self->{_hssStack}[0];
    my $document = delete $self->{_hsxXML};

    if ( $top->{CTX} ne 'Document' ) {
        $document = $document->createDocumentFragment();
    }

    foreach my $child ( @{ $top->{CHILDREN} } ) {
        $document->addChild($child);
    }
    $top->{CONTENT} = $document;
    return;
}

#===================================
sub output_start { }
*output_end         = \&output_start;
*output_declaration = \&output_start;
*output_process     = \&output_start;
*output             = \&output_start;
#===================================

my $Entities = { 'amp'  => '&',
                 'lt'   => '<',
                 'gt'   => '>',
                 'quot' => '"',
                 '#39'  => "'",
};
#===================================
sub output_text {
#===================================
    my ( $self, $text ) = @_;
    HTML::Entities::_decode_entities( $text, $Entities );
    push @{ $self->{_hssStack}[0]{CHILDREN} },
        $self->{_hsxXML}->createTextNode($text);
    return;
}

#===================================
sub output_comment {
#===================================
    my ( $self, $comment ) = @_;
    $comment =~ s/^\s*<!--//g;
    $comment =~ s/-->\s*$//g;
    push @{ $self->{_hssStack}[0]{CHILDREN} },
        $self->{_hsxXML}->createComment($comment);
    return;
}

#===================================
sub output_stack_entry {
#===================================
    my ( $self, $tag ) = @_;

    my %entry;
    $tag->{CHILDREN} ||= [];
    @entry{qw(tag attr children)} = @{$tag}{qw(NAME ATTR CHILDREN)};

    if ( my $tag_callback = $tag->{CALLBACK} ) {
        $tag_callback->( $self, \%entry )
            or return;
    }

    if ( my $tagname = $entry{tag} ) {
        my $element = $self->{_hsxXML}->createElement($tagname);
        my $attrs   = $entry{attr};
        foreach my $name ( sort keys %$attrs ) {
            $element->setAttribute( $name => $attrs->{$name} );
        }
        unless ( $tag->{CTX} eq 'EMPTY' ) {
            foreach my $children ( @{ $entry{children} } ) {
                $element->addChild($children);
            }
        }
        push @{ $self->{_hssStack}[0]{CHILDREN} }, $element;
    }
    else {
        push @{ $self->{_hssStack}[0]{CHILDREN} }, @{ $entry{children}
+ };
    }
    $tag->{CHILDREN} = [];
}

=head1 BUGS AND LIMITATIONS

=over

=item API - BETA

This is the first draft of this module, and currently there are no con
+figuration
options for the XML. I would welcome feedback from XML users as to how
+ I could
improve the interface.

For this reason, the API may change.

=back

=head1 SEE ALSO

L<HTML::Parser>, L<HTML::StripScripts::Parser>,
L<HTML::StripScripts::Regex>

=head1 AUTHOR

Clinton Gormley E<lt>clint@traveljury.comE<gt>

=head1 COPYRIGHT

Copyright (C) 2007 Clinton Gormley.  All Rights Reserved.

=head1 LICENSE

This module is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

=cut

1;
[download]

UPDATE: Uploaded this module to CPAN as HTML::StripScripts::LibXML, and removed the requirement for a new version of HTML::StripScripts.

Comment on RFC: HTML::StripScripts::LibXML Select or Download Code

Replies are listed 'Best First'.
Re: RFC: HTML::StripScripts::LibXML by Moron (Curate) on Jul 04, 2007 at 09:32 UTC
The context of this reply is I happened to write an XML code generator recently. A1) XML is used for a wide variety of things so it's hard to say what options should be included. And the options I tend to put in a code generator may not be very 'normal', but FWIW here they are based on just two market data applications I had to support with an XML code generator (amazing how detailed applications can get!) - the options I went for happen to answer some of your other questions, but my idea was to write a complete callback-driven parser and generator - different goal! - whether or not to generate opening tags for tags excluded by other options ('other options' including those handled in a different method or module) - whether or not to generate closing tags for tags excluded by 'other options' - whether or not to generate subtags (as opposed to relying on callback functionality to do it). - whether or not to generate values for non-nested tags (as opposed to letting callback functionality do it). - I use a separate trivial method for putting the XML version line - but that therefore functions as an option to do so or not. - I use $$_ as the default output buffer (by ref) but allow it to be optuionally overridden with another scalar reference - By default no filehandle to write to but one may be passed as an option. - I have optional tag introducer (by default empty but some XML standards need e.g. a '- ' in front of tags. - max depth of tag introducer. - two options for to setting min and max tag nesting depth at which code is generated by default (because callbacks may or may not be used to specifically generate code). - a callback all option, a code reference for execution for all tags. - callback by tag nesting depth options ('user' can specify same or different code references for different depths). Instead of just a tag callback by name I have: - callback before opening a tag by name - callback after opening a tag by name - callback before closing a tag by name - callback after closing a tag by name - optional unit tabbing string for output, default "\t" - I use the x operator to multiply the tabbing string by current depth - 1. A2) I can't see the functionality of output_text - I have in its place an outer method called puttag that uses a separate recursive tree traversing method rather than a manual stack. __________________________________________________________________________________ ^M Free your mind!	[reply]

Replies are listed 'Best First'.

Re: RFC: HTML::StripScripts::LibXML
by Moron (Curate) on Jul 04, 2007 at 09:32 UTC

A1) XML is used for a wide variety of things so it's hard to say what options should be included. And the options I tend to put in a code generator may not be very 'normal', but FWIW here they are based on just two market data applications I had to support with an XML code generator (amazing how detailed applications can get!) - the options I went for happen to answer some of your other questions, but my idea was to write a complete callback-driven parser and generator - different goal!

- whether or not to generate opening tags for tags excluded by other options ('other options' including those handled in a different method or module)

- whether or not to generate closing tags for tags excluded by 'other options'

- whether or not to generate subtags (as opposed to relying on callback functionality to do it).

- whether or not to generate values for non-nested tags (as opposed to letting callback functionality do it).

- I use a separate trivial method for putting the XML version line - but that therefore functions as an option to do so or not.

- I use $$_ as the default output buffer (by ref) but allow it to be optuionally overridden with another scalar reference

- By default no filehandle to write to but one may be passed as an option.

- I have optional tag introducer (by default empty but some XML standards need e.g. a '- ' in front of tags.

- max depth of tag introducer.

- two options for to setting min and max tag nesting depth at which code is generated by default (because callbacks may or may not be used to specifically generate code).

- a callback all option, a code reference for execution for all tags.

- callback by tag nesting depth options ('user' can specify same or different code references for different depths).

Instead of just a tag callback by name I have:

- callback before opening a tag by name

- callback after opening a tag by name

- callback before closing a tag by name

- callback after closing a tag by name

- optional unit tabbing string for output, default "\t" - I use the x operator to multiply the tabbing string by current depth - 1.

A2) I can't see the functionality of output_text - I have in its place an outer method called puttag that uses a separate recursive tree traversing method rather than a manual stack.

__________________________________________________________________________________

^M Free your mind!

[reply]