Devasundaram has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am working in BPO Company. I need the conversion tool of ISO entity to UNICODE entity. For example, "=" => "=" (here unicode value not keyboard character(ampersand, hash, 'x', and 003D)). Please help with step by step instruction. Because I was tried with "Unicode::UTF8simple" Package. But I was not get the result. Advance Thanks Deva

Replies are listed 'Best First'.
Re: ISO to UNICODE
by ikegami (Patriarch) on Aug 28, 2007 at 06:46 UTC

    I was about to suggest decode_entities in HTML::Entities (which converts HTML/XHTML entities such as é to UNICODE characters), but while = looks like an HTML/XHTML entity, it isn't.

    If you can't find a module that does what you want, you could model one based on HTML::Entities.

Re: ISO to UNICODE
by graff (Chancellor) on Aug 29, 2007 at 02:11 UTC
    Knowing what version of perl you are using would be more important that knowing what company you work for. ;) Also, if you really want "Unicode entities", I think this would refer to "numeric character entity references", like "Ӓ" or "ƫ", where the numeric code-point value of the unicode character is expressed in decimal or hexadecimal digits.

    There are web sites that appear to offer listings of "standard" character entities, with the unicode code point values for each entity name -- I found one list of tables here, but you may need to look further to find others.

    Given a reference like the one just cited, I might save each reference page to a file and use perl to convert it to a mapping table, like this:

    my $entity; while (<>) { if ( /<!ENTITY\s+(\S+)\s+"([^"]+)/ ) { my ( $name, $char ) = ( $1, $2 ); $char =~ s/\&\#x([0-9a-f]{2,4});/chr(hex($1))/e; $char =~ s/\&\#([0-9]+);/chr($1)/e; $entity{$name} = $char; } } binmode STDOUT, ":utf8"; print "$_\t$entity{$_}\n" for ( sort {$entity{$a} cmp $entity{$b}} key +s %entity );
    (update: added the binmode call -- very important for getting the output right, and equally important when reading the data back in from a file.)

    Now I just need to save (redirect) the output of that process to a file, and use the file as a lookup table in any script that is going to convert character entity references to unicode characters. Just read that file into a hash (just like the %entity hash in the script above), and use the hash to filter data like this:

    open( ENTS, "<:utf8", $entfile ) or die "$entfile: $!"; my %entity = map { split } <ENTS>; my $enames = join "|", keys %entity; while (<>) { s/\&($enames);/$entity{$1}/g; }
    (updated to include the part about opening and reading the entity list file produced by the previous script -- making sure to treat the file data as utf8)