srikrishnan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am trying to read a xml and get some of the required part from the xml and write it as a new xml. I am successfully collect the data by using the following script

use strict; use warnings; undef $/; open OUT, ">:encoding(UTF-8)", "D:/wordpress/wordpress_categories.xml" +; open (IN, "<:encoding(UTF-8)", "D:/wordpress/wordpress.2011-04-12.xml" +); my $line = <IN>; while ($line =~ /<title>(.*?)<\/title>\n\t\t<link>(.*?)<\/link>\n\ +t\t<pubDate>(.*?)<\/pubDate>\n\t\t<dc:creator>(.*?)<\/dc:creator>\n\t +\t\n\t\t<category>(.*?)<\/category>/i) { $line =~ s/(<title>(.*?)<\/title>\n\t\t<link>(.*?)<\/link>\n\t +\t<pubDate>(.*?)<\/pubDate>\n\t\t<dc\:creator>(.*?)<\/dc\:creator>\n\ +t\t\n\t\t<category>(.*?)<\/category>)//i; print OUT "$1\n\n"; } close (IN); close (OUT);

but the output xml is not produce the non english characters correctly. below is the wrong output

<title>எழுத ‹வேண்டிய கட்டு‹ரை +ள்</title> <link>http://naatkurippugal.wordpress.com/?p=501</link> <pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate> <dc:creator><![CDATA[ஸ்ரீஹரி]]></dc:creator> <category><![CDATA[கட்டு‹ரை]]></category>

can anybody help me to solve this problem?

Thanks in Advance,

srikrishnan

Replies are listed 'Best First'.
Re: How to write a utf-8 file
by ikegami (Patriarch) on Apr 13, 2011 at 04:20 UTC
    Your code is correct. Either the input file isn't UTF-8, or your viewer isn't treating the output as UTF-8.
      Your code is correct.

      Mostly. Since it adds literals to the string, it should also use utf8;, to avoid mixing of decoded and undecoded strings.

        He should add use utf8; if and only if his source code is UTF-8.
        He should omit use utf8; if and only if his source code is iso-8859-1.

        There's absolutely no indication as to which is correct here, so I don't see how you can suggest one over the other.

Re: How to write a utf-8 file
by Nikhil Jain (Monk) on Apr 13, 2011 at 04:24 UTC

    General remarks:

    1. use three argument open with lexical file handle and exception handling i.e, instead of writing this,

    open (IN, "<:encoding(UTF-8)", "D:/wordpress/wordpress.2011-04-12.xml" +);

    write like,

    open(my $fh, "<:encoding(UTF-8)", "filename") || die "can't open UTF-8 encoded filename: $!";

    2. you are trying to parse the xml, dont use regular expresions, better to use XML::Simple or XML::Twig

      Thanks for your response

      Really I am not able to understand what how you are trying to help me

      As I am clearly mentioned in my mail, I have no problem in reading the xml

      problem is only with writing into a xml

      I want to confirm, which is the correct way, how can I write other than english text properly in the OUT xml?

      Thanks

      srikrishnan

        As ikegami posted above, your code is correct.
        Try it with a different input file and open the output file with a different text editor.
        Occasionally, Notepad++ fails to show unicode charters for me even when the file itself is OK. Close and reopen usually fixes it.

        Consider what happens if the file doesn't exist:

        In your original code, things will go wrong with no explanation.
        In the suggested alternative, the code will print "can't open UTF-8 encoded filename: File Not Found" and then exit safely.

        Depending on the specific problem, $! could be file not found, permission denied, out of disk space, locked by another process, etc... whatever reason the OS gives. Extremely helpful!

        You're already using the 3-arg version of open, which is good. You can add lexical file handles ($inFH rather than just IN), checking the return value (the "||", or better yet, "or"), and printing $! when things do go wrong. These are all good habits to get into, as they will help you avoid debugging pain in the future.