Re: HTML::Lint and utf-8 document woes

The problem lies in this subroutine defined in Lint.pm:

sub _text {
    my ($self,$text) = @_;

    while ( $text =~ /([^\x09\x0A\x0D -~])/g ) {
        my $bad = $1;
        $self->gripe(
            'text-use-entity', 
                char => sprintf( '\x%02lX', ord($bad) ),
                entity => $char2entity{ $bad },
        );
    }
}
[download]

Notice how anything/everything outside of the ASCII range is going to be considered as "bad" and the "operative theory" here is that you are supposed to use an entity for all such characters, no matter what the html header says.

Well, crap. I hate that sort of attitude in a module, but if you really want to toe that line, there's a not-too-offensive way to do that... Filter the html data so that all the wide characters are turned into entities:

$html =~ s/([^\x09\x0A\x0D -~])/sprintf("&#%d;",ord($1))/eg;

$lint->parse ($html);
[download]

There! That shut him up. Maybe you don't want to go to such lengths, and most likely the right solution would be to fix that function in Lint.pm ... Your choice.

Comment on Re: HTML::Lint and utf-8 document woes Select or Download Code

Replies are listed 'Best First'.
Re^2: HTML::Lint and utf-8 document woes by GrandFather (Saint) on Nov 01, 2006 at 04:02 UTC
Excellent work! Actually in this case filtering the data is a good immediate workaround, but I'll update the bug report I've already submitted about the issue. ;) DWIM is Perl's answer to Gödel	[reply]
Re^3: HTML::Lint and utf-8 document woes by graff (Chancellor) on Nov 01, 2006 at 04:17 UTC
I'd like to point out that the ease and speed with which I spotted the problem and the work-around should serve as high praise for the module author -- espcially considering that I had never seen (let alone used) this module before (in fact, I installed it on my mac just to try out the OP code, and getting to the answer was a matter of minutes). The HTML::Lint code is very well put together, and my gripe about "attitude" was purely for amusement. (update: but I am still curious/mystified why the use of entity references for non-ASCII characters should count as a "STRUCTURAL" issue...)	[reply]
Re^4: HTML::Lint and utf-8 document woes by rhesa (Vicar) on Nov 01, 2006 at 04:39 UTC
I'll second your comment about the quality of the code -- you beat me to the punch, and had a more well-rounded answer, but I did end up fairly quickly in the same spot (to my defense, I'm dealing with an annoying RAID sync at work ;)	[reply]