Knowing what version of perl you are using would be more important that knowing what company you work for. ;) Also, if you really want "Unicode entities", I think this would refer to "numeric character entity references", like "Ӓ" or "ƫ", where the numeric code-point value of the unicode character is expressed in decimal or hexadecimal digits.
There are web sites that appear to offer listings of "standard" character entities, with the unicode code point values for each entity name -- I found one list of tables here, but you may need to look further to find others.
Given a reference like the one just cited, I might save each reference page to a file and use perl to convert it to a mapping table, like this:
my $entity;
while (<>) {
if ( /<!ENTITY\s+(\S+)\s+"([^"]+)/ ) {
my ( $name, $char ) = ( $1, $2 );
$char =~ s/\&\#x([0-9a-f]{2,4});/chr(hex($1))/e;
$char =~ s/\&\#([0-9]+);/chr($1)/e;
$entity{$name} = $char;
}
}
binmode STDOUT, ":utf8";
print "$_\t$entity{$_}\n" for ( sort {$entity{$a} cmp $entity{$b}} key
+s %entity );
(update: added the binmode call -- very important for getting the output right, and equally important when reading the data back in from a file.)
Now I just need to save (redirect) the output of that process to a file, and use the file as a lookup table in any script that is going to convert character entity references to unicode characters. Just read that file into a hash (just like the %entity hash in the script above), and use the hash to filter data like this:
open( ENTS, "<:utf8", $entfile ) or die "$entfile: $!";
my %entity = map { split } <ENTS>;
my $enames = join "|", keys %entity;
while (<>) {
s/\&($enames);/$entity{$1}/g;
}
(updated to include the part about opening and reading the entity list file produced by the previous script -- making sure to treat the file data as utf8) |