Re: HTML Parsing

(Revised)

After some exploration with HTML::Parser, which is a really powerful HTML processor, I have discovered a way to use it to re-write HTML. The trick is to tell the Parser to give your handlers enough information to do their job, which is to limit the types of tags available in the HTML.

When you feed HTML::Parser your content, it can tell your handlers the offset and length of the tag, as it appeared in the original document. You can use this information to copy that section of code out of the original and into the 'safe' version using substr().

HTML Safe Content
PerlMonks is an application which restricts the use of HTML somewhat. It appears to allow only a list of tags in markup, and this list changes depending on the context of the data (i.e. Chatterbox vs writeup vs home node), but the principle is the same.

The first thing to do is construct an HTML limiting function which, given a list of allowed tags and a content scalar, will return a "safe" version of the content which can be applied as required. To remove all HTML from a content string, don't allow any tags (empty list, or no list supplied) and it should zap them all.

Additionally, you could sub-process the "text" content of the HTML by having a handler which URL-ifies the text. Since you must be correctly identifying the tags within the HTML fixer, you will be able to distinguish between HTML tags and their content, such as HREF="http://www.xyzco.com/", which should not be URL-ified.

HTML::Parser has several different handlers, but you should start with:

        my ($hp) = new HTML::Parser (
                            api_version => 3,
                            start_h     => [ $Tag,  'tagname,attr,offs
+et,length' ],
                            end_h       => [ $Tag,  'tagname,attr,offs
+et,length' ],
                            text_h      => [ $Text, 'text' ],
                        );
[download]

'start' and 'end' refer to start and end tags, respectively, and not the start and end of the HTML document. Ending tags are those which are identified by leading slashes, such as '</A>'.

Further, HTML::Parser is kind enough to return the $tagname in lower-case only, so you don't have to do any case checking before using the data.

A typical handler is declared prior to the HTML::Parser somewhat like this:

        my ($Tag) = sub
        {
                my ($tagname, $attr, $offset, $length) = @_;

                if ($tags->{$tagname})
                {
                        $safe_content .= substr($content,$offset,$leng
+th);
                }
        };
[download]

The reason for using the anonymous-sub handle is to allow the handler to modify a function-local $safe_content, as to the best of my knowledge you can't pass your own parameters to your handler on-top of what HTML::Parser gives you.

URL Identification and URL-ification
It will be a little tricky to identify valid URLs within the content of the page, but a quick regex that looks for appropriate domains should do okay. It will be important to look out for things that aren't domains but might look like them, such as "mysong.au" or "command.com". (Really old versions of Netscape accidentally tried to handle Australian domain names as .au sound files. Whoops!)

You can use URI::Heuristic to "guess" what a URL should correctly be listed as. This module is smart enough to recognize that ftp.netscape.com should be "ftp://ftp.netscape.com", and other similar tricks.

libwww-perl a.k.a. LWP is a very comprehensive set of libraries, but they are more suited towards building a HTTP agent or Perl-based browser than they are for this kind of low-level HTML work.

Putting it all together
Here's my first crack at it. YMMV.

use HTML::Parser;

my (%tags_1) = map { (lc($_), 1) } qw ( A B BR P );

sub HTMLSafeContent
{
             my ($content, $tags) = @_;
             my ($safe_content);

             my ($Tag) = sub
             {
                     my ($tagname, $attr, $offset, $length) = @_;

                     if ($tags->{$tagname})
                     {
                             $safe_content .= substr($content,$offset,
+$length);
                     }
             };

             my ($Text) = sub
             {
                     my ($text) = @_;

                     $safe_content .= $text;
                     # Or, perhaps:
                     #   $safe_content .= URLIfy($text);
             };

             my ($hp) = new HTML::Parser (
                                     api_version   => 3,
                                     start_h => [ $Tag,  'tagname,attr
+,offset,length' ],
                                     end_h   => [ $Tag,  'tagname,attr
+,offset,length' ],
                                     text_h  => [ $Text, 'dtext' ],
                             );

             $hp->parse($content);

             return $safe_content;
     }

     # Using it:

     # Load content into $some_content

     print HTMLSafeContent ($some_content, \%tags_1);
[download]

Comment on Re: HTML Parsing Select or Download Code