HTML::Lint and utf-8 document woes

GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

As part of a build process I have a script that checks various HTML documents using HTML::Lint. Some of the recent documents use utf-8 and, despite a content="text/html; charset=utf-8" attribute in the HTML head meta tag, HTML::Lint chokes on the utf-8 characters. Is there a work around for this?

Sample code follows:

use strict;
use warnings;
use utf8;
use HTML::Lint;

my $lint = HTML::Lint->new (only_types => HTML::Lint::Error::STRUCTURE
+);
my $html = do {local $/; <DATA>};

$lint->parse ($html);
$lint->eof ();

my @lintErrsOrg = map {$_->as_string ()} $lint->errors ();
print join "\nError Lint org: ", @lintErrsOrg;
  
__DATA__
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w
+3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>utf8 test</title>
</head>
<body>
<p>į</p>
</body>
</html>
[download]

Prints:

 (8:1) Invalid character \xE7 should be written as &ccedil;
[download]

Update: ya, hai, that was also posted by me. :)

DWIM is Perl's answer to Gödel

Comment on HTML::Lint and utf-8 document woes Select or Download Code

Replies are listed 'Best First'.
Re: HTML::Lint and utf-8 document woes by graff (Chancellor) on Nov 01, 2006 at 03:57 UTC
The problem lies in this subroutine defined in Lint.pm: `sub _text { my ($self,$text) = @_; while ( $text =~ /([^\x09\x0A\x0D -~])/g ) { my $bad = $1; $self->gripe( 'text-use-entity', char => sprintf( '\x%02lX', ord($bad) ), entity => $char2entity{ $bad }, ); } }` [download] Notice how anything/everything outside of the ASCII range is going to be considered as "bad" and the "operative theory" here is that you are supposed to use an entity for all such characters, no matter what the html header says. Well, crap. I hate that sort of attitude in a module, but if you really want to toe that line, there's a not-too-offensive way to do that... Filter the html data so that all the wide characters are turned into entities: `$html =~ s/([^\x09\x0A\x0D -~])/sprintf("&#%d;",ord($1))/eg; $lint->parse ($html);` [download] There! That shut him up. Maybe you don't want to go to such lengths, and most likely the right solution would be to fix that function in Lint.pm ... Your choice.	[reply] [d/l] [select]
Re^2: HTML::Lint and utf-8 document woes by GrandFather (Saint) on Nov 01, 2006 at 04:02 UTC
Excellent work! Actually in this case filtering the data is a good immediate workaround, but I'll update the bug report I've already submitted about the issue. ;) DWIM is Perl's answer to Gödel	[reply]
Re^3: HTML::Lint and utf-8 document woes by graff (Chancellor) on Nov 01, 2006 at 04:17 UTC
I'd like to point out that the ease and speed with which I spotted the problem and the work-around should serve as high praise for the module author -- espcially considering that I had never seen (let alone used) this module before (in fact, I installed it on my mac just to try out the OP code, and getting to the answer was a matter of minutes). The HTML::Lint code is very well put together, and my gripe about "attitude" was purely for amusement. (update: but I am still curious/mystified why the use of entity references for non-ASCII characters should count as a "STRUCTURAL" issue...)	[reply]
Re^4: HTML::Lint and utf-8 document woes by rhesa (Vicar) on Nov 01, 2006 at 04:39 UTC
Re: HTML::Lint and utf-8 document woes by rhesa (Vicar) on Nov 01, 2006 at 03:26 UTC
\xE7 doesn't look like a valid utf8 character. It seems like a latin1 version instead. Since your c-cedille shows properly in this page, and the document charset is iso-8859-1, that's most likely the case. FWIW, I'm rather partial to HTML::Tidy myself...	[reply]
Re^2: HTML::Lint and utf-8 document woes by GrandFather (Saint) on Nov 01, 2006 at 03:41 UTC
The bytes in the .pl file are actually `C3 A7`. It is possible that they have been rendered differently in the process of pasting the code into PerlMonks and then rendered inside code tags by PerlMonks. DWIM is Perl's answer to Gödel	[reply] [d/l]
Re^3: HTML::Lint and utf-8 document woes by rhesa (Vicar) on Nov 01, 2006 at 04:05 UTC
OK, that is indeed the correct utf8 encoding. I suppose that means HTML::Lint does do bad things. In fact, it looks to me like the _text() method in HTML::Lint::Parser gets the wrong encoding back from HTML::Parser.	[reply]