Re^4: XML:: DOM and Accented Characters

Thanks for the help but I'm still unable to get it to work even after adding the BOM, although I am learning along the way

I'm now using both TextPad and NotePad++ (with plugin) to view the codes for the output file (accentTestOutput.xml). I've also run it on both my work and home pc's - both running Windows.

After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD. It also looks as if the BOM is not there, I am unable to see the code EF BB BF at the start of the file (which is what I should see right?).

Using the package UTF8BOM to insert the BOM I can see the BOM is there in both cases (TextPad and NotePad++) due to seeing EF BB BF at the start of the file. However both programs now display E9 as the code for the e-acute not the C3 A9 I'm looking for.

Incidently at no point have I been able to open the output file in Internet Explorer, It complains of an invalid character at the point of the e-acute.

Here's the output after trying to insert the BOM using

print $fh "\x{feff}";

TextPad

       
 0: 3C 3F 78 6D 6C 20 76 65  72 73 69 6F 6E 3D 22 31  <?xml version="1
10: 2E 30 22 20 65 6E 63 6F  64 69 6E 67 3D 22 55 54  .0" encoding="UT
20: 46 2D 38 22 3F 3E 0D 0A  3C 54 45 53 54 3E 20 E9  F-8"?>..<TEST> é
30: 20 3C 2F 54 45 53 54 3E  0D 0A                     </TEST>..
[download]

NotePad++

       
3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31
2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54
46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54 3e 20 ef
bf bd 20 3c 2f 54 45 53 54 3e 0d 0a
[download]

Here's the output after trying to insert the BOM using the UTF8BOM perl package using

UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml');

You can see the BOM code at the begining of the file

TextPad

       
0: EF BB BF 3C 3F 78 6D 6C  20 76 65 72 73 69 6F 6E  ï»¿<?xml version
10: 3D 22 31 2E 30 22 20 65  6E 63 6F 64 69 6E 67 3D  ="1.0" encoding=
20: 22 55 54 46 2D 38 22 3F  3E 0D 0A 3C 54 45 53 54  "UTF-8"?>..<TEST
30: 3E 20 E9 20 3C 2F 54 45  53 54 3E 0D 0A           > é </TEST>..
[download]

NotePad++

       
ef bb bf 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e
3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d
22 55 54 46 2d 38 22 3f 3e 0d 0a 3c 54 45 53 54
3e 20 e9 20 3c 2f 54 45 53 54 3e 0d 0a
[download]

I'm at the edge of what I know so don't really know where to go from here. I appreciate the help you given, any other ideas? If I've missed out some info that may be useful let me know.

Comment on Re^4: XML:: DOM and Accented Characters Select or Download Code

Replies are listed 'Best First'.

Re^5: XML:: DOM and Accented Characters
by Pickwick (Beadle) on Aug 07, 2010 at 15:26 UTC

After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD.

E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.

[reply]

Re^6: XML:: DOM and Accented Characters

by freeflyer (Novice) on Aug 07, 2010 at 18:07 UTC

Picwick, here's the code prior to trying any of the suggestions made. It's a simple test xml with nothing but a couple a spaces and an e-accute.

I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

<?xml version="1.0" encoding="UTF-8"?>
<TEST> é </TEST>
[download]

And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

use XML::DOM;

 my $parser = new XML::DOM::Parser;
 my $doc = $parser->parsefile ("c:\\accentTest.xml");

 # Print doc file
 $doc->printToFile ("c:\\accentTestOutPut.xml");

 # Print to string
 print $doc->toString;

 # cleanup
 $doc->dispose;
[download]

[reply]
[d/l]
[select]

Re^7: XML:: DOM and Accented Characters

by Pickwick (Beadle) on Aug 08, 2010 at 12:59 UTC

Picwick, here's the code prior to trying any of the suggestions made.

We don't need the code prior the suggestions because we already know why this code can't work as expacted. We need the code where you override automatic encoding of the perl I/O layer with >:utf8, because this code really should work.

Give us your latest code, there shure is an error somewhere.

[reply]

Re^8: XML:: DOM and Accented Characters

by freeflyer (Novice) on Aug 09, 2010 at 08:59 UTC

Re^9: XML:: DOM and Accented Characters

by almut (Canon) on Aug 09, 2010 at 11:44 UTC

Some notes below your chosen depth have not been shown here

Re^5: XML:: DOM and Accented Characters
by Anonymous Monk on Aug 07, 2010 at 11:41 UTC

not the C3 A9 I'm looking for

Then you're not looking for UTF-8!!!!!

$ perl -e"print qq!\x{C3A9}!
Wide character in print at -e line 1.
&#8734;Ä&#8976;
$ perl -Mopen=:std,:encoding(UTF-8) -e"print qq!\x{C3A9}!" |hexdump
00000000: EC 8E A9                -                         |   |
00000003;

$ perl -Mopen=:std,:encoding(UTF-16LE) -e"print qq!\x{C3A9}!" |hexdump
00000000: A9 C3                   -                         |  |
00000002;

$ perl -Mopen=:std,:encoding(UTF-16BE) -e"print qq!\x{C3A9}!" |hexdump
00000000: C3 A9                   -                         |  |
00000002;

$
[download]

encoding="UTF-8"?

[reply]
[d/l]
[select]

Re^6: XML:: DOM and Accented Characters

by freeflyer (Novice) on Aug 07, 2010 at 12:14 UTC

I'm not sure what you mean, all the encoding tables I have looked at show the e-acute as hex C3 A9 under UTF8?

[reply]

Re^7: XML:: DOM and Accented Characters

by Anonymous Monk on Aug 07, 2010 at 12:25 UTC

I'm not sure what you mean, all the encoding tables I have looked at show the e-acute as hex C3 A9 under UTF8?

Sure they do, its perl that must be broken :)

http://www.fileformat.info/info/unicode/char/c3a9/index.htm

Encodings

HTML Entity (decimal) 쎩

HTML Entity (hex) 쎩

How to type in Microsoft Windows Alt +C3A9

UTF-8 (hex) 0xEC 0x8E 0xA9 (ec8ea9)

UTF-8 (binary) 11101100:10001110:10101001

UTF-16 (hex) 0xC3A9 (c3a9)

UTF-16 (decimal) 50,089

UTF-32 (hex) 0x0000C3A9 (c3a9)

UTF-32 (decimal) 50,089

C/C++/Java source code "\uC3A9"

Python source code u"\uC3A9"

More...

[reply]

Re^8: XML:: DOM and Accented Characters

by freeflyer (Novice) on Aug 07, 2010 at 12:55 UTC

Re^9: XML:: DOM and Accented Characters

by Anonymous Monk on Aug 07, 2010 at 13:02 UTC