Knowing what version of perl you are using would be more important that knowing what company you work for. ;) Also, if you really want "Unicode entities", I think this would refer to "numeric character entity references", like "Ӓ" or "ƫ", where the numeric code-point value of the unicode character is expressed in decimal or hexadecimal digits.

There are web sites that appear to offer listings of "standard" character entities, with the unicode code point values for each entity name -- I found one list of tables here, but you may need to look further to find others.

Given a reference like the one just cited, I might save each reference page to a file and use perl to convert it to a mapping table, like this:

my $entity; while (<>) { if ( /<!ENTITY\s+(\S+)\s+"([^"]+)/ ) { my ( $name, $char ) = ( $1, $2 ); $char =~ s/\&\#x([0-9a-f]{2,4});/chr(hex($1))/e; $char =~ s/\&\#([0-9]+);/chr($1)/e; $entity{$name} = $char; } } binmode STDOUT, ":utf8"; print "$_\t$entity{$_}\n" for ( sort {$entity{$a} cmp $entity{$b}} key +s %entity );
(update: added the binmode call -- very important for getting the output right, and equally important when reading the data back in from a file.)

Now I just need to save (redirect) the output of that process to a file, and use the file as a lookup table in any script that is going to convert character entity references to unicode characters. Just read that file into a hash (just like the %entity hash in the script above), and use the hash to filter data like this:

open( ENTS, "<:utf8", $entfile ) or die "$entfile: $!"; my %entity = map { split } <ENTS>; my $enames = join "|", keys %entity; while (<>) { s/\&($enames);/$entity{$1}/g; }
(updated to include the part about opening and reading the entity list file produced by the previous script -- making sure to treat the file data as utf8)

In reply to Re: ISO to UNICODE by graff
in thread system status by Devasundaram

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.