in reply to Re^3: XML:: DOM and Accented Characters
in thread XML:: DOM and Accented Characters

Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way

I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows.

After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?).

Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for.

Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute.

Here's the output after trying to insert the BOM using

 print $fh "\x{feff}";

TextPad

0: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 10: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 20: 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 3E 20 E9 F-8"?>..<TEST> é 30: 20 3C 2F 54 45 53 54 3E 0D 0A </TEST>..

NotePad++

3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef bf bd 20 3c 2f 54 45 53 54 3e 0d 0a

Here's the output after trying to insert the BOM using the UTF8BOM perl package using

UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');

You can see the BOM code at the begining of the file

TextPad

0: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E <?xml version 10: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 20: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 "UTF-8"?>..<TEST 30: 3E 20 E9 20 3C 2F 54 45 53 54 3E 0D 0A > é </TEST>..

NotePad++

ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a

I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.

Replies are listed 'Best First'.
Re^5: XML:: DOM and Accented Characters
by Pickwick (Beadle) on Aug 07, 2010 at 15:26 UTC
    After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD.

    E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.

      Picwick, here's the code prior to trying any of the suggestions made. It's a simple test xml with nothing but a couple a spaces and an e-accute.

      I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

      <?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>

      And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

      use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;
        Picwick, here's the code prior to trying any of the suggestions made.

        We don't need the code prior the suggestions because we already know why this code can't work as expacted. We need the code where you override automatic encoding of the perl I/O layer with >:utf8, because this code really should work.

        Give us your latest code, there shure is an error somewhere.

Re^5: XML:: DOM and Accented Characters
by Anonymous Monk on Aug 07, 2010 at 11:41 UTC
    not the C3 A9 I'm looking for

    Then you're not looking for UTF-8!!!!!

    $ perl -e"print qq!\x{C3A9}! Wide character in print at -e line 1. &#8734;Ä&#8976; $ perl -Mopen=:std,:encoding(UTF-8) -e"print qq!\x{C3A9}!" |hexdump 00000000: EC 8E A9 - | | 00000003; $ perl -Mopen=:std,:encoding(UTF-16LE) -e"print qq!\x{C3A9}!" |hexdump 00000000: A9 C3 - | | 00000002; $ perl -Mopen=:std,:encoding(UTF-16BE) -e"print qq!\x{C3A9}!" |hexdump 00000000: C3 A9 - | | 00000002; $
    UTF16-BE shows C3A9, and it is not UTF-8 as encoding="UTF-8"? claims

      I'm not sure what you mean, all the encoding tables I have looked at show the e-acute as hex C3 A9 under UTF8?

        I'm not sure what you mean, all the encoding tables I have looked at show the e-acute as hex C3 A9 under UTF8?

        Sure they do, its perl that must be broken :)

        http://www.fileformat.info/info/unicode/char/c3a9/index.htm
        Encodings
        HTML Entity (decimal) &#50089;
        HTML Entity (hex) &#xc3a9;
        How to type in Microsoft Windows Alt +C3A9
        UTF-8 (hex) 0xEC 0x8E 0xA9 (ec8ea9)
        UTF-8 (binary) 11101100:10001110:10101001
        UTF-16 (hex) 0xC3A9 (c3a9)
        UTF-16 (decimal) 50,089
        UTF-32 (hex) 0x0000C3A9 (c3a9)
        UTF-32 (decimal) 50,089
        C/C++/Java source code "\uC3A9"
        Python source code u"\uC3A9"
        More...