(Revised)
After some exploration with HTML::Parser, which is a
really powerful HTML processor, I have discovered a way
to use it to re-write HTML. The trick is to tell the Parser
to give your handlers enough information to do their job,
which is to limit the types of tags available in the HTML.
When you feed HTML::Parser your content, it can tell your
handlers the offset and length of the tag, as it appeared
in the original document. You can use this information to
copy that section of code out of the original and into the
'safe' version using substr().
HTML Safe Content
PerlMonks is an application which restricts the use of
HTML somewhat. It appears to allow only a list of tags in
markup, and this list changes depending on the context of
the data (i.e. Chatterbox vs writeup vs home node), but the
principle is the same.
The first thing to do is construct an HTML limiting function which, given a list
of allowed tags and a content scalar, will return a "safe"
version of the content which can be applied as required.
To remove all HTML from a content string, don't allow
any tags (empty list, or no list supplied) and it should zap
them all.
Additionally, you could sub-process the "text" content of
the HTML by having a handler which URL-ifies the text.
Since you must be correctly identifying the tags within the
HTML fixer, you will be able to distinguish between HTML
tags and their content, such as
HREF="http://www.xyzco.com/", which should not be URL-ified.
HTML::Parser has several different handlers, but you should
start with:
my ($hp) = new HTML::Parser (
api_version => 3,
start_h => [ $Tag, 'tagname,attr,offs
+et,length' ],
end_h => [ $Tag, 'tagname,attr,offs
+et,length' ],
text_h => [ $Text, 'text' ],
);
'start' and 'end' refer to start and end tags, respectively, and
not the start and end of the HTML document. Ending tags are
those which are identified by leading slashes, such as
'</A>'.
Further, HTML::Parser is kind enough to return the $tagname
in lower-case only, so you don't have to do any case checking
before using the data.
A typical handler is declared prior to the HTML::Parser
somewhat like this:
my ($Tag) = sub
{
my ($tagname, $attr, $offset, $length) = @_;
if ($tags->{$tagname})
{
$safe_content .= substr($content,$offset,$leng
+th);
}
};
The reason for using the anonymous-sub handle is to allow
the handler to modify a function-local $safe_content, as
to the best of my knowledge you can't pass your own parameters
to your handler on-top of what HTML::Parser gives you.
URL Identification and URL-ification
It will be a little tricky to identify valid URLs within
the content of the page, but a quick regex that looks
for appropriate domains should do okay. It will be important
to look out for things that aren't domains but might look
like them, such as "mysong.au" or "command.com". (Really
old versions of Netscape accidentally tried to handle
Australian domain names as .au sound files. Whoops!)
You can use URI::Heuristic to "guess" what a URL should
correctly be listed as. This module is smart enough to
recognize that ftp.netscape.com should be "ftp://ftp.netscape.com",
and other similar tricks.
libwww-perl a.k.a. LWP is a very comprehensive set of
libraries, but they are more suited towards building a
HTTP agent or Perl-based browser than they are for this kind
of low-level HTML work.
Putting it all together
Here's my first crack at it. YMMV.
use HTML::Parser;
my (%tags_1) = map { (lc($_), 1) } qw ( A B BR P );
sub HTMLSafeContent
{
my ($content, $tags) = @_;
my ($safe_content);
my ($Tag) = sub
{
my ($tagname, $attr, $offset, $length) = @_;
if ($tags->{$tagname})
{
$safe_content .= substr($content,$offset,
+$length);
}
};
my ($Text) = sub
{
my ($text) = @_;
$safe_content .= $text;
# Or, perhaps:
# $safe_content .= URLIfy($text);
};
my ($hp) = new HTML::Parser (
api_version => 3,
start_h => [ $Tag, 'tagname,attr
+,offset,length' ],
end_h => [ $Tag, 'tagname,attr
+,offset,length' ],
text_h => [ $Text, 'dtext' ],
);
$hp->parse($content);
return $safe_content;
}
# Using it:
# Load content into $some_content
print HTMLSafeContent ($some_content, \%tags_1);
| [reply] [d/l] [select] |
To find/remove/modify existing HTML tags, I would recommend
Parse::RecDescent, by Damian Conway. It does some deep
voodoo, but is very powerful, more powerful than any regex could ever
be.-- Brian | [reply] |
This is what I've ended up with:
if ($field eq "comments") {
# Remove any links (because they break URL to link conversion)
$$field =~ s/<A.*?HRef.*?>//isg; $$field =~ s/<\/A>//isg;
# Extract any image links and add them to an array for safe-ke
+eping, replace them with placeholders
$image_database = 0;
while ($$field =~ /<Img(.*?)>/) {
$$field =~ s/(<Img(.*?)>)/\[My_Image=$image_database\]/iso
+;
$images[$image_database] = $1;
$image_database ++;
}
# If HTML is not allowed, strip any remaining HTML
if ($allow_html != 1) { $$field =~ s/<(?:[^>'"]*|(['"]).*?\1)*
+>//gs; }
# Convert URL's and e-mail addresses to links (with regex)
$$field =~ s/(((ht|f)tp):(\/\/)[a-z0-9%&_\-\+=:@~#\/.\?]+(\/|[
+a-z]))/<A HRef="$1" Target="_blank">$1<\/A>/isg;
$$field =~ s/(^\W|\s)([a-z0-9_\-.]+\@[a-z0-9_\-]+\.[a-z]+)(.*?
+$)/$1<A HRef="mailto:$2">$2<\/A>$3/mig;
# Replace the image placeholders with their corresponding imag
+es
$image_database = 0;
while ($$field =~ /\[My_Image=(\d*)\]/) {
$img_src = $images[$1];
$$field =~ s/\[My_Image=(\d*)\]/$img_src/iso;
$image_database ++;
}
}
(Yes, I know I'm not using "strict" - this is a prototype only).
Anyone see any problems with this code?
In theory, there is no difference between theory and practise. But in practise, there is. Jonathan M. Hollin Digital-Word.com | [reply] [d/l] |
Just realised that
$$field =~ s/<A.*?HRef.*?>//isg; $$field =~ s/<\/A>//isg; is going to screw up any <A Name...> tags... damn...
In theory, there is no difference between theory and practise. But in practise, there is. Jonathan M. Hollin Digital-Word.com
| [reply] [d/l] |