slaniel has asked for the wisdom of the Perl Monks concerning the following question:

I have a bunch of HTML files that have some custom XML tags in them, all of whose namespaces begin with 'mig:' (e.g., <mig:old_url>). When I run these documents through HTML::TreeBuilder, these tags either get dropped (because, presumably, they're not part of the HTML spec) or their contents get stuck into the body.

Is there any way to tell HTML::Element and HTML::TreeBuilder to parse invalid tags -- or at least certain invalid tags -- as they would parse any legit tags?

Replies are listed 'Best First'.
Re: Custom XML tags in HTML::TreeBuilder and HTML::Element
by eric256 (Parson) on Jul 24, 2006 at 16:47 UTC

    Reading the docs for HTML::TreeBuilder seem to indicate that you can use $root->ignore_unknown(0); to tell it not to ignore those tags.


    ___________
    Eric Hodges
      Ah, my mistake. I've been spending all my time reading the HTML::Element docs, and have not spent enough time with ::TreeBuilder. Thanks a bunch, Eric! P.S.: It'd be handy if we could set ::TreeBuilder to allow only certain unknown tags -- in my case, just 'mig:' tags. I'd certainy like to clear Microsoft's 'mso:' tags out, for instance.

        Glimpsing through the source, it doesn't look as if it'd be too hard to patch it so that ignore_unknown could be a coderef instead of a boolean value. Then you could set it to a predicate which looks at the tag name and returns whether or not to ignore it.

        ## ... circa line 152 of HTML/TreeBuilder.pm $self->{'_ignore_unknown'} = sub { 1 }; ## ... circa line 660 in HTML/TreeBuilder.pm if( $self->{ '_ignore_unknown' }->( $tag ) ) { print $indent, " * Ignoring unknown tag \U$tag\E\n" if DEBUG $self->warning("Skipping unknown tag $tag"); return } ## ... later in your code $tree->ignore_unknown( sub { return 1 if $_[0] !~ /^mig:/ } );

        As is you'd have to make your own copy and/or edit the installed version instead of overriding in a subclass. But still, easily do-able (and if you get it working right submit a patch :).

Re: Custom XML tags in HTML::TreeBuilder and HTML::Element
by Ieronim (Friar) on Jul 24, 2006 at 17:39 UTC
    HTML::TreeBuilder inherits from HTML::Parser, and HTML::Parser has an "XML mode", so you maybe can turn it on for TreeBuilder.

    merlyn has a column on parsing XML data with HTML::Parser, its content might help :)


         s;;Just-me-not-h-Ni-m-P-Ni-lm-I-ar-O-Ni;;tr?IerONim-?HAcker ?d;print