comment on

Due to carelessness on my part I had a shed load of html containing suspect characters. The difficulty was having a possible combination of x80-x9F (frowned on by w3c), unicode and html entities (including numeric entities).
The strategy I arrived at was to:

decode any entities present
convert x80-x9F to unicode equivalents
encode 'unsafe' characters

This will ensure, hopefully, consistant html and prevent problems during any future processing.

What do you reckon?

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

my $lookup = get_cp1252_lookup();

my $str = join('',
  chr(0x93), 'double', chr(0x94),
  chr(0x201C), 'double', chr(0x201D),
  '&lsquo;single&rsquo;'
);

# "replaces HTML entities...
#  with the corresponding Unicode character"
decode_entities($str); 

# replaces x80-x9f with unicode equivalant
$str =~ s/([\x80-\x9f])/$lookup->{sprintf("%x", ord($1))}/eg;

# "replaces unsafe characters...
#  with their entity representation"
encode_entities($str);

print "$str\n";

sub get_cp1252_lookup{
  
  open my $fh, '<', 'cp1252_to_unicode.txt'
    or die "can't open input: $!";
  
  my $lookup;
  
  while (<$fh>){
    my ($cp1252, $utf8_str, $name) = split /\t/;
    $cp1252 =~ s/0x//;
    my $utf8 = $utf8_str =~ / /? '':chr(oct($utf8_str));
    $lookup->{$cp1252} = $utf8;
  }
  return $lookup;
}

__END__
output:

&ldquo;double&rdquo;&ldquo;double&rdquo;&lsquo;single&rsquo;

extract from cp1252_to_unicode.txt:

0x91    0x2018    #LEFT SINGLE QUOTATION MARK
0x92    0x2019    #RIGHT SINGLE QUOTATION MARK
0x93    0x201C    #LEFT DOUBLE QUOTATION MARK
0x94    0x201D    #RIGHT DOUBLE QUOTATION MARK
[download]

Many thanks to all the monks who have helped.
John

In reply to Fixing suspect characters in HTML by wfsp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks