As explained in this node, I was asked to add an interface to HTML::StripScripts so that it would return either an XML::LibXML::Document object or an XML::LibXML::DocumentFragment object (for a snippet of HTML)

I've done this by subclassing HTML::StripScripts::Parser and overriding the output callbacks in HTML::StripScripts. So the flow is as follows:

HTML::StripScripts::Parser --> uses HTML::Parser to tokenise the HTML --> uses callbacks in HTML::StripScripts to filter the XSS an +d tidy the HTML --> uses callbacks in HTML::StripScripts::LibXML to build + a DOM tree
The code works, it passes the tests for HTML::StripScripts::Parser (with a couple of modifications, eg <b></b> in the XML->toString output is represented as <b/>).

The questions

I don't do much with XML, so I would appreciate feedback from XML-ers. My questions are:

The code

I have posted the code below, but it requires a newer version of HTML::StripScripts than the one currently on CPAN which works with the current version of HTML::StripScripts on CPAN. Also, just uploaded this module to CPAN as HTML::StripScripts::LibXML - available as soon as your mirror syncs.
package HTML::StripScripts::LibXML; use strict; use vars qw($VERSION); $VERSION = '0.10'; =head1 NAME HTML::StripScripts::LibXML - XSS filter - outputs a LibXML Document o +r DocumentFragment =head1 SYNOPSIS use HTML::StripScripts::LibXML(); my $hss = HTML::StripScripts::LibXML->new( { Context => 'Document', ## HTML::StripScripts configur +ation Rules => { ... }, }, strict_comment => 1, ## HTML::Parser options strict_names => 1, ); $hss->parse_file("foo.html"); $xml_doc = $hss->filtered_document; OR $xml_doc = $hss->filter_html($html); =head1 DESCRIPTION This class provides an easy interface to C<HTML::StripScripts>, using C<HTML::Parser> to parse the HTML, and returns an XML::LibXML::Documen +t or XML::LibXML::DocumentFragment. See L<HTML::Parser> for details of how to customise how the raw HTML i +s parsed into tags, and L<HTML::StripScripts> for details of how to customise t +he way those tags are filtered. This module is a subclass of L<HTML::StripScripts::Parser>. =cut =head1 DIFFERENCES FROM HTML::StripScripts =over =item CONTEXT HTML::StripScripts::LibXML still allows you to specify the C<Context> +of the HTML (Document, Flow, Inline, NoTags). If C<Context> is C<Document>, t +hen it returns an C<XML::LibXML::Document> object, otherwise it returns an C<XML::LibXML::DocumentFragment> object. =item TAG CALLBACKS HTML::StripScripts allows you to use tag callbacks, for instance: $hss = HTML::StripScripts->new({ Rules => { a => \&a_callback } }); sub a_callback { my ($filter,$element) = @_; # where $element = { # tag => 'a', # attr => { href => '/index.html' }, # content => 'Go to <b>Home</b> page', # } return 1; } HTML::StripScripts::LibXML still gives you tag callbacks, but they lo +ok like this: sub a_callback { my ($filter,$element) = @_; # where $element = { # tag => 'a', # attr => { href => '/index.html' }, # children => [ # XML::LibXML::Text --> 'Go to ', # XML::LibXML::Element --> 'b' # with child Text --> 'Home', # XML::LibXML::Text --> ' page', # ], # } return 1; } =item SUBCLASSING The subs C<output>, C<output_start> and C<output_end> are not called. + Instead, this module uses C<output_stack_entry> which handles the tag callback, + (and depending on the result of the tag callback) creates an element and ad +ds its child nodes. Then it adds the element to the list of children for +the parent tag. =back =head1 CONSTRUCTORS =over =item new ( {CONFIG}, [PARSER_OPTIONS] ) Creates a new C<HTML::StripScripts::LibXML> object. See L<HTML::StripScripts::Parser> for details. =back =cut use base 'HTML::StripScripts::Parser'; use XML::LibXML(); use HTML::Entities(); #=================================== sub output_start_document { #=================================== my ($self) = @_; $self->{_hsxXML} = XML::LibXML::Document->new(); return; } #=================================== sub output_end_document { #=================================== my ($self) = @_; my $top = $self->{_hssStack}[0]; my $document = delete $self->{_hsxXML}; if ( $top->{CTX} ne 'Document' ) { $document = $document->createDocumentFragment(); } foreach my $child ( @{ $top->{CHILDREN} } ) { $document->addChild($child); } $top->{CONTENT} = $document; return; } #=================================== sub output_start { } *output_end = \&output_start; *output_declaration = \&output_start; *output_process = \&output_start; *output = \&output_start; #=================================== my $Entities = { 'amp' => '&', 'lt' => '<', 'gt' => '>', 'quot' => '"', '#39' => "'", }; #=================================== sub output_text { #=================================== my ( $self, $text ) = @_; HTML::Entities::_decode_entities( $text, $Entities ); push @{ $self->{_hssStack}[0]{CHILDREN} }, $self->{_hsxXML}->createTextNode($text); return; } #=================================== sub output_comment { #=================================== my ( $self, $comment ) = @_; $comment =~ s/^\s*<!--//g; $comment =~ s/-->\s*$//g; push @{ $self->{_hssStack}[0]{CHILDREN} }, $self->{_hsxXML}->createComment($comment); return; } #=================================== sub output_stack_entry { #=================================== my ( $self, $tag ) = @_; my %entry; $tag->{CHILDREN} ||= []; @entry{qw(tag attr children)} = @{$tag}{qw(NAME ATTR CHILDREN)}; if ( my $tag_callback = $tag->{CALLBACK} ) { $tag_callback->( $self, \%entry ) or return; } if ( my $tagname = $entry{tag} ) { my $element = $self->{_hsxXML}->createElement($tagname); my $attrs = $entry{attr}; foreach my $name ( sort keys %$attrs ) { $element->setAttribute( $name => $attrs->{$name} ); } unless ( $tag->{CTX} eq 'EMPTY' ) { foreach my $children ( @{ $entry{children} } ) { $element->addChild($children); } } push @{ $self->{_hssStack}[0]{CHILDREN} }, $element; } else { push @{ $self->{_hssStack}[0]{CHILDREN} }, @{ $entry{children} + }; } $tag->{CHILDREN} = []; } =head1 BUGS AND LIMITATIONS =over =item API - BETA This is the first draft of this module, and currently there are no con +figuration options for the XML. I would welcome feedback from XML users as to how + I could improve the interface. For this reason, the API may change. =back =head1 SEE ALSO L<HTML::Parser>, L<HTML::StripScripts::Parser>, L<HTML::StripScripts::Regex> =head1 AUTHOR Clinton Gormley E<lt>clint@traveljury.comE<gt> =head1 COPYRIGHT Copyright (C) 2007 Clinton Gormley. All Rights Reserved. =head1 LICENSE This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;
UPDATE: Uploaded this module to CPAN as HTML::StripScripts::LibXML, and removed the requirement for a new version of HTML::StripScripts.

Replies are listed 'Best First'.
Re: RFC: HTML::StripScripts::LibXML
by Moron (Curate) on Jul 04, 2007 at 09:32 UTC
    The context of this reply is I happened to write an XML code generator recently.

    A1) XML is used for a wide variety of things so it's hard to say what options should be included. And the options I tend to put in a code generator may not be very 'normal', but FWIW here they are based on just two market data applications I had to support with an XML code generator (amazing how detailed applications can get!) - the options I went for happen to answer some of your other questions, but my idea was to write a complete callback-driven parser and generator - different goal!

    - whether or not to generate opening tags for tags excluded by other options ('other options' including those handled in a different method or module)

    - whether or not to generate closing tags for tags excluded by 'other options'

    - whether or not to generate subtags (as opposed to relying on callback functionality to do it).

    - whether or not to generate values for non-nested tags (as opposed to letting callback functionality do it).

    - I use a separate trivial method for putting the XML version line - but that therefore functions as an option to do so or not.

    - I use $$_ as the default output buffer (by ref) but allow it to be optuionally overridden with another scalar reference

    - By default no filehandle to write to but one may be passed as an option.

    - I have optional tag introducer (by default empty but some XML standards need e.g. a '- ' in front of tags.

    - max depth of tag introducer.

    - two options for to setting min and max tag nesting depth at which code is generated by default (because callbacks may or may not be used to specifically generate code).

    - a callback all option, a code reference for execution for all tags.

    - callback by tag nesting depth options ('user' can specify same or different code references for different depths).

    Instead of just a tag callback by name I have:

    - callback before opening a tag by name

    - callback after opening a tag by name

    - callback before closing a tag by name

    - callback after closing a tag by name

    - optional unit tabbing string for output, default "\t" - I use the x operator to multiply the tabbing string by current depth - 1.

    A2) I can't see the functionality of output_text - I have in its place an outer method called puttag that uses a separate recursive tree traversing method rather than a manual stack.

    __________________________________________________________________________________

    ^M Free your mind!