Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am getting files from various sources in different European languages. The requirement in these files is that everything should be in ASCII compatible.

So for special characters in Danish, Finnish, the UTF-8 codes here ( UTF codes ) are typed in directly.So a line of text could contain

"This line contains 0xC30x86n exotic character."
And this should be printed into HTML and PDF with the right fused AE character :
"This line contains AEn exotic character."

Changing the format of the files is not an option as it is easy for everyone to type the UTF codes for a character they do not even know.

My question is this: How should I read these text files, evaluate these special characters on the fly.

Some pointers would be much appreciated, Many thanks Chandra

Replies are listed 'Best First'.
Re: Evaluating UTF codes in a file
by almut (Canon) on Nov 26, 2009 at 16:51 UTC
    #!/usr/bin/perl use Encode; my $s = "This line contains 0xC30x86n exotic character."; $s =~ s/0x([\da-fA-F]{2})/chr(hex($1))/ge; my $u = decode('UTF-8', $s);

    $u would then contain a Perl character/unicode string that you can encode to whatever format you need for the HTML or PDF output.

Re: Evaluating UTF codes in a file
by ikegami (Patriarch) on Nov 26, 2009 at 16:53 UTC
    $_ = "This line contains 0xC30x86n exotic character."; s/0x([0-9a-fA-F]{2})/chr(hex($1))/eg; utf8::decode($_);

    You could create an HTML from that string as follows:

    use HTML::Entities qw( encode_entities ); open(my $fh, '>:encoding(UTF-8)', 'file.html') or die; print $fh qq{<meta http-equiv="Content-Type" content="text/html; chars +et=UTF-8">\n}; print $fh qq{<title>Test</title>\n}; print $fh encode_entities($_); # Escapes &<>'"

    The file will contain

    3C 6D 65 74 61 20 68 74-74 70 2D 65 71 75 69 76 <meta http-equiv 3D 22 43 6F 6E 74 65 6E-74 2D 54 79 70 65 22 20 ="Content-Type" 63 6F 6E 74 65 6E 74 3D-22 74 65 78 74 2F 68 74 content="text/ht 6D 6C 3B 20 63 68 61 72-73 65 74 3D 55 54 46 2D ml; charset=UTF- 38 22 3E 0D 0A 3C 74 69-74 6C 65 3E 54 65 73 74 8">..<title>Test 3C 2F 74 69 74 6C 65 3E-0D 0A 54 68 69 73 20 6C </title>..This l 69 6E 65 20 63 6F 6E 74-61 69 6E 73 20 26 41 45 ine contains &AE 6C 69 67 3B 6E 20 65 78-6F 74 69 63 20 63 68 61 lig;n exotic cha 72 61 63 74 65 72 2E racter.

    The HTML reader will render it as

    This line contains Æn exotic character.

Re: Evaluating UTF codes in a file
by Anonymous Monk on Nov 26, 2009 at 20:54 UTC
    Thank you gentlemen! Exactly what I was looking for.