comment on

I'm using HTML::Tagset and HTML::PullParser to attack this problem. HTML::Tagset contains a list of valid tags for HTML 3.2 and HTML 4 I believe as well. HTML::PullParser is a part of the HTML::Parser package, but its API isn't callback-based but it gives us one token at a time.

My code is as follows :

sub check_bogus_html_tags {
  # now check for bogus tags:
  my ($body) = @_;
  my $reason = "";
  use HTML::Tagset;
  use HTML::PullParser;
  my $p = HTML::PullParser->new(
    doc   => \$body,
    start => '"S", tagname',
    end   => '"E", tagname',
  );
  my %seen;
  while (my $token = $p->get_token()) {
    my ($start,$tag) = @$token;
    $seen{$tag}++
      unless ($HTML::Tagset::isKnown{$tag} );
  };
  $reason = "Bogus tags " . join(" ",sort keys %seen) . "\n"
    if (scalar keys %seen > 10 );
};
[download]

Use it as follows:

  # decode the possibly encoded body, either
  # from MIME-multipart message or from message body
  $body = unpack_mail_body($mail);

  # body is HTML  
  # Check the HTML for bad dtds etc.
  $part_reason .= "wrong inline dtd\n"
    if $body =~ m#<\s*!\s*[a-z]{1,5}\s*>#mg > 5;

  $part_reason .= check_bogus_html_tags($body);
[download]

That should be all of it :-)

In reply to Re: regex help or pointer to module needed by Corion
in thread regex help or pointer to module needed by Xxaxx

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.