HTML Parsing

DarkBlue has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a web-based database application in Perl 5.6. All its control, input and output are via the web-browser...

When creating a new record (records contain a "document" field, whereby a large body of text is captured (press releases, bulletins, notices, etc)) I am currently trying to parse for various elements: filtering obscenities, stripping html, converting URL's and e-mail addresses to links, hard-coding carriage-returns (why do we still call them that?) and spaces, converting emoticons to icons, etc, etc.

I am doing all of the above with regex at the moment and, while these are okay most of the time, there are many conditions that cause them to fail. I would like to implement the following rules:

If HTML is not permitted in the database (use configurable option) then all HTML input by the creator of the record (i.e. not generated by the application itself) is to be removed. This should work even if the HTML is "broken" or runs over multiple lines

Convert URL's to links. This should work for e-mail addresses as well as "http", "https", "ftp", URL's without the "http://" prefix (www.mydomain.com), URL's without "www" (http://sub-domain.mydomain.com/), etc.

URL's that are within HTML tag parameters should not be converted to links: i.e. don't convert this URL -> <Img Src="http://www.mydomain.com/gfx/sexygirl.jpg"> as that will break the HTML (assuming that the user is permitting HTML)

URL's that are encapsulated in tags should be converted to links: i.e. do convert this URL -> <Center>http://www.mydomain.com/</Center>

Now I know that with these demands, I've stepped well outside of the quick and easy regex (certainly with my knowledge of regex anyway) and that CPAN is my only solution.

I would like to use only those modules that are part of the standard Perl distribution (if at all possible), as I want to distribute this application when it's completed and want installation to be as simple as possible for the both the novice user and those users who haven't got physical access to a server and are therefore relying on un-cooperative ISP's for their CGI programs. However, I will use a "non-standard" module if that's the only way to achieve what I want.

Searching CPAN, I think the modules I need are URI::URL and libwww-perl (LWP).

Is this correct, can these provide the mechanism between them to perform the tasks outlined above?

If so, could anyone post or e-mail me with example code (bonus points for anyone supplying code relating to the tasks above) as I have become more and more confused with the documentation for both of the above modules?

Huge thanks in advance.

In theory, there is no difference between theory and practise. But in practise, there is.

Jonathan M. Hollin
Digital-Word.com

Comment on HTML Parsing

Replies are listed 'Best First'.
Re: HTML Parsing by tadman (Prior) on Feb 12, 2001 at 04:19 UTC
(Revised) After some exploration with HTML::Parser, which is a really powerful HTML processor, I have discovered a way to use it to re-write HTML. The trick is to tell the Parser to give your handlers enough information to do their job, which is to limit the types of tags available in the HTML. When you feed HTML::Parser your content, it can tell your handlers the offset and length of the tag, as it appeared in the original document. You can use this information to copy that section of code out of the original and into the 'safe' version using substr(). HTML Safe Content PerlMonks is an application which restricts the use of HTML somewhat. It appears to allow only a list of tags in markup, and this list changes depending on the context of the data (i.e. Chatterbox vs writeup vs home node), but the principle is the same. The first thing to do is construct an HTML limiting function which, given a list of allowed tags and a content scalar, will return a "safe" version of the content which can be applied as required. To remove all HTML from a content string, don't allow any tags (empty list, or no list supplied) and it should zap them all. Additionally, you could sub-process the "text" content of the HTML by having a handler which URL-ifies the text. Since you must be correctly identifying the tags within the HTML fixer, you will be able to distinguish between HTML tags and their content, such as HREF="http://www.xyzco.com/", which should not be URL-ified. HTML::Parser has several different handlers, but you should start with: `my ($hp) = new HTML::Parser ( api_version => 3, start_h => [ $Tag, 'tagname,attr,offs +et,length' ], end_h => [ $Tag, 'tagname,attr,offs +et,length' ], text_h => [ $Text, 'text' ], );` [download] 'start' and 'end' refer to start and end tags, respectively, and not the start and end of the HTML document. Ending tags are those which are identified by leading slashes, such as '</A>'. Further, HTML::Parser is kind enough to return the $tagname in lower-case only, so you don't have to do any case checking before using the data. A typical handler is declared prior to the HTML::Parser somewhat like this: `my ($Tag) = sub { my ($tagname, $attr, $offset, $length) = @_; if ($tags->{$tagname}) { $safe_content .= substr($content,$offset,$leng +th); } };` [download] The reason for using the anonymous-sub handle is to allow the handler to modify a function-local $safe_content, as to the best of my knowledge you can't pass your own parameters to your handler on-top of what HTML::Parser gives you. URL Identification and URL-ification It will be a little tricky to identify valid URLs within the content of the page, but a quick regex that looks for appropriate domains should do okay. It will be important to look out for things that aren't domains but might look like them, such as "mysong.au" or "command.com". (Really old versions of Netscape accidentally tried to handle Australian domain names as .au sound files. Whoops!) You can use URI::Heuristic to "guess" what a URL should correctly be listed as. This module is smart enough to recognize that ftp.netscape.com should be "ftp://ftp.netscape.com", and other similar tricks. libwww-perl a.k.a. LWP is a very comprehensive set of libraries, but they are more suited towards building a HTTP agent or Perl-based browser than they are for this kind of low-level HTML work. Putting it all together Here's my first crack at it. YMMV. use HTML::Parser; my (%tags_1) = map { (lc($_), 1) } qw ( A B BR P ); sub HTMLSafeContent { my ($content, $tags) = @_; my ($safe_content); my ($Tag) = sub { my ($tagname, $attr, $offset, $length) = @_; if ($tags->{$tagname}) { $safe_content .= substr($content,$offset, +$length); } }; my ($Text) = sub { my ($text) = @_; $safe_content .= $text; # Or, perhaps: # $safe_content .= URLIfy($text); }; my ($hp) = new HTML::Parser ( api_version => 3, start_h => [ $Tag, 'tagname,attr +,offset,length' ], end_h => [ $Tag, 'tagname,attr +,offset,length' ], text_h => [ $Text, 'dtext' ], ); $hp->parse($content); return $safe_content; } # Using it: # Load content into $some_content print HTMLSafeContent ($some_content, \%tags_1); [download]	[reply] [d/l] [select]
Re: HTML Parsing by IndyZ (Friar) on Feb 12, 2001 at 04:03 UTC
To find/remove/modify existing HTML tags, I would recommend Parse::RecDescent, by Damian Conway. It does some deep voodoo, but is very powerful, more powerful than any regex could ever be. -- Brian	[reply]
Re: HTML Parsing by DarkBlue (Sexton) on Feb 12, 2001 at 05:18 UTC
This is what I've ended up with: if ($field eq "comments") { # Remove any links (because they break URL to link conversion) $$field =~ s/<A.?HRef.?>//isg; $$field =~ s/<\/A>//isg; # Extract any image links and add them to an array for safe-ke +eping, replace them with placeholders $image_database = 0; while ($$field =~ /<Img(.?)>/) { $$field =~ s/(<Img(.?)>)/\[My_Image=$image_database\]/iso +; $images[$image_database] = $1; $image_database ++; } # If HTML is not allowed, strip any remaining HTML if ($allow_html != 1) { $$field =~ s/<(?:[^>'"]\|(['"]).?\1)* +>//gs; } # Convert URL's and e-mail addresses to links (with regex) $$field =~ s/(((ht\|f)tp):(\/\/)[a-z0-9%&_\-\+=:@~#\/.\?]+(\/\|[ +a-z]))/<A HRef="$1" Target="_blank">$1<\/A>/isg; $$field =~ s/(^\W\|\s)([a-z0-9_\-.]+\@[a-z0-9_\-]+\.[a-z]+)(.? +$)/$1<A HRef="mailto:$2">$2<\/A>$3/mig; # Replace the image placeholders with their corresponding imag +es $image_database = 0; while ($$field =~ /\[My_Image=(\d)\]/) { $img_src = $images[$1]; $$field =~ s/\[My_Image=(\d*)\]/$img_src/iso; $image_database ++; } } [download] (Yes, I know I'm not using "strict" - this is a prototype only). Anyone see any problems with this code? In theory, there is no difference between theory and practise. But in practise, there is. Jonathan M. Hollin Digital-Word.com	[reply] [d/l]
Re: Re: HTML Parsing by DarkBlue (Sexton) on Feb 12, 2001 at 05:51 UTC
Just realised that `$$field =~ s/<A.?HRef.?>//isg; $$field =~ s/<\/A>//isg;` is going to screw up any <A Name...> tags... damn... In theory, there is no difference between theory and practise. But in practise, there is. Jonathan M. Hollin Digital-Word.com	[reply] [d/l]