in reply to Re^2: XML:: DOM and Accented Characters
in thread XML:: DOM and Accented Characters

I think almut hit the mark: MS-Windows apps like wordpad, notepad, etc, all depend on having a file-initial byte-order-mark, expressed as the 3-byte utf8 rendering of the code point "U+FEFF", to serve as a sort of "magic number" so that the app "knows" the file contains utf8 data.
  • Comment on Re^3: XML:: DOM and Accented Characters

Replies are listed 'Best First'.
Re^4: XML:: DOM and Accented Characters
by freeflyer (Novice) on Aug 07, 2010 at 10:09 UTC

    Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way

    I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows.

    After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?).

    Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for.

    Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute.

    Here's the output after trying to insert the BOM using

     print $fh "\x{feff}";

    TextPad

    0: 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 <?xml version="1 10: 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 .0" encoding="UT 20: 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 3E 20 E9 F-8"?>..<TEST> é 30: 20 3C 2F 54 45 53 54 3E 0D 0A </TEST>..

    NotePad++

    3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef bf bd 20 3c 2f 54 45 53 54 3e 0d 0a

    Here's the output after trying to insert the BOM using the UTF8BOM perl package using

    UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');

    You can see the BOM code at the begining of the file

    TextPad

    0: EF BB BF 3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E <?xml version 10: 3D 22 31 2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D ="1.0" encoding= 20: 22 55 54 46 2D 38 22 3F 3E 0D 0A 3C 54 45 53 54 "UTF-8"?>..<TEST 30: 3E 20 E9 20 3C 2F 54 45 53 54 3E 0D 0A > é </TEST>..

    NotePad++

    ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a

    I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.

      After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD.

      E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.

        Picwick, here's the code prior to trying any of the suggestions made. It's a simple test xml with nothing but a couple a spaces and an e-accute.

        I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

        <?xml version="1.0" encoding="UTF-8"?> <TEST> é </TEST>

        And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

        use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;
      not the C3 A9 I'm looking for

      Then you're not looking for UTF-8!!!!!

      $ perl -e"print qq!\x{C3A9}! Wide character in print at -e line 1. &#8734;Ä&#8976; $ perl -Mopen=:std,:encoding(UTF-8) -e"print qq!\x{C3A9}!" |hexdump 00000000: EC 8E A9 - | | 00000003; $ perl -Mopen=:std,:encoding(UTF-16LE) -e"print qq!\x{C3A9}!" |hexdump 00000000: A9 C3 - | | 00000002; $ perl -Mopen=:std,:encoding(UTF-16BE) -e"print qq!\x{C3A9}!" |hexdump 00000000: C3 A9 - | | 00000002; $
      UTF16-BE shows C3A9, and it is not UTF-8 as encoding="UTF-8"? claims

        I'm not sure what you mean, all the encoding tables I have looked at show the e-acute as hex C3 A9 under UTF8?