moleary has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a perl script that reads lines of Japanese text from a utf-8 encoded text file and merges them into an XML output file that is also utf-8 encoded.

There are about 160,000 lines of Japanese text, and in about 50 of the lines, one of the Japanese characters comes out in the output file as garbage characters because the lead one or two bytes of a utf-8 character are not output to the file. In each case, the single byte value of the garbage character corresponds to the second or third byte of the utf-8 character that was supposed to be written there.

I added some debugging code to write just the Japanese text, without XML tags, into a separate utf-8 encoded file, and the character that is written incorrectly in the XML output file comes out correctly in the debug file.

Even odder, the script reads an XML file that contains entries from other languages and writes those to the output XML file along with entries created from the Japanese text, and if I remove any entries from the input XML file, the corruption in the output file doesn't occur where it used to. (I haven't been able to find out yet if it goes away or just occurs in different places.)

I also added calls to utf8::is_utf8 and utf8::valid on the Japanese strings, and they return true for all of them. The function that writes out the Japanese text looks like this:

sub print_japanese_synonyms { my ($current_term_id) = @_; my $term = $gJATermHash{$current_term_id}; print $gOutputFileObj "\t\t<synonyms language=\"JA\">\n"; if ($term) { my $pref_term = $term->get_preferred(); my @synonyms = $term->get_synonyms(); print $gOutputFileObj "\t\t\t<synonym string=\"$pref_term\" preferre +d=\"YES\"/>\n"; foreach my $syn (@synonyms) { print $gOutputFileObj "\t\t\t<synonym string=\"$syn\"/>\n"; } } else { # If there is no Japanese term for this term id, write out the Japan +ese "general term" string. print $gOutputFileObj "\t\t\t\<synonym string=\"\x{4E00}\x{822c}\x{7 +528}\x{8A9E}\" preferred=\"YES\"/>\n"; } print $gOutputFileObj "\t\t</synonyms>\n"; }
I have "use utf8;" set and the function that opens the output file looks like this:
sub open_output_file { $gOutputFileObj = new IO::File $gOutputFile, "w"; die "Can't open output file $gOutputFile.\n", "Error in file: \"", __FILE__, "\", at line: ", __LINE__, "\n" unles +s $gOutputFileObj; binmode($gOutputFileObj, ":utf8"); }
I am using Active Perl 5.8.3 Build 809. I hope I have supplied enough parts of my script to show what I am doing. I would appreciate any help in figuring out how to diagnose and/or solve this problem.

Replies are listed 'Best First'.
Re: utf-8 bytes dropped when printing to a file
by graff (Chancellor) on Jun 11, 2004 at 02:25 UTC
    I added some debugging code to write just the Japanese text, without XML tags, into a separate utf-8 encoded file, and the character that is written incorrectly in the XML output file comes out correctly in the debug file.

    About the text written to the debug file: did it always come from the "$term->get_preferred()" and "$term->get_synonyms()" method calls, or did it just come directly from the input Japanese utf8 file? What is the nature of the "term" object that holds the Japanese data? (If the debug output didn't come from the object methods, you need to try it that way, but I assume you've already covered this.)

    Even odder, the script reads an XML file that contains entries from other languages and writes those to the output XML file along with entries created from the Japanese text, and if I remove any entries from the input XML file, the corruption in the output file doesn't occur where it used to.

    I guess that could make it hard to demonstrate the problem with minimal snippets of code and data. Still, if the problem really does depend on the input XML data in some way, you should study the initial, known-buggy output, and create a sample from the XML input file consisting of the entries adjacent to the problem, and limit the Japanese input to the entry or entries that are the problem. See if you can replicate the error with a minimal amount of data. (While you're at it, see if you can create a stripped-down version of the script, too -- just enough to produce the error. If the object pointed to by $term is big and hairy, that might be the place to start clipping.)

    If that doesn't clarify the problem for you, at least you'll have a specific example that can be posted. BTW, I don't see any problems with the code in the original post, except that the object method calls leave a lot to the imagination.

    (update: thinking about it a little more, the only sort of "typical" problem I can imagine that would create the symptom you describe would be any sort of improper mixing of output writing methods -- e.g. using both "syswrite" and "print" on the same output file handle -- and this could be producing other corruptions in the output that might not be as noticeable as the ones you're seeing.)

      This turned out to be a false alarm. Perl and my script were doing the right thing, and there was a bug in the text editor that was corrupting some UTF-8 characters. I notified them of the problem and they sent out a patched version that corrects the problem.
Re: utf-8 bytes dropped when printing to a file
by BrowserUk (Patriarch) on Jun 11, 2004 at 05:00 UTC

    This is a "it might change something" suggestion only. Change your binmode statement to

    binmode($gOutputFileObj, ":raw:utf8");

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail