There are web sites that appear to offer listings of "standard" character entities, with the unicode code point values for each entity name -- I found one list of tables here, but you may need to look further to find others.
Given a reference like the one just cited, I might save each reference page to a file and use perl to convert it to a mapping table, like this:
(update: added the binmode call -- very important for getting the output right, and equally important when reading the data back in from a file.)my $entity; while (<>) { if ( /<!ENTITY\s+(\S+)\s+"([^"]+)/ ) { my ( $name, $char ) = ( $1, $2 ); $char =~ s/\&\#x([0-9a-f]{2,4});/chr(hex($1))/e; $char =~ s/\&\#([0-9]+);/chr($1)/e; $entity{$name} = $char; } } binmode STDOUT, ":utf8"; print "$_\t$entity{$_}\n" for ( sort {$entity{$a} cmp $entity{$b}} key +s %entity );
Now I just need to save (redirect) the output of that process to a file, and use the file as a lookup table in any script that is going to convert character entity references to unicode characters. Just read that file into a hash (just like the %entity hash in the script above), and use the hash to filter data like this:
(updated to include the part about opening and reading the entity list file produced by the previous script -- making sure to treat the file data as utf8)open( ENTS, "<:utf8", $entfile ) or die "$entfile: $!"; my %entity = map { split } <ENTS>; my $enames = join "|", keys %entity; while (<>) { s/\&($enames);/$entity{$1}/g; }
In reply to Re: ISO to UNICODE
by graff
in thread system status
by Devasundaram
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |