moleary has asked for the wisdom of the Perl Monks concerning the following question:
There are about 160,000 lines of Japanese text, and in about 50 of the lines, one of the Japanese characters comes out in the output file as garbage characters because the lead one or two bytes of a utf-8 character are not output to the file. In each case, the single byte value of the garbage character corresponds to the second or third byte of the utf-8 character that was supposed to be written there.
I added some debugging code to write just the Japanese text, without XML tags, into a separate utf-8 encoded file, and the character that is written incorrectly in the XML output file comes out correctly in the debug file.
Even odder, the script reads an XML file that contains entries from other languages and writes those to the output XML file along with entries created from the Japanese text, and if I remove any entries from the input XML file, the corruption in the output file doesn't occur where it used to. (I haven't been able to find out yet if it goes away or just occurs in different places.)
I also added calls to utf8::is_utf8 and utf8::valid on the Japanese strings, and they return true for all of them. The function that writes out the Japanese text looks like this:
I have "use utf8;" set and the function that opens the output file looks like this:sub print_japanese_synonyms { my ($current_term_id) = @_; my $term = $gJATermHash{$current_term_id}; print $gOutputFileObj "\t\t<synonyms language=\"JA\">\n"; if ($term) { my $pref_term = $term->get_preferred(); my @synonyms = $term->get_synonyms(); print $gOutputFileObj "\t\t\t<synonym string=\"$pref_term\" preferre +d=\"YES\"/>\n"; foreach my $syn (@synonyms) { print $gOutputFileObj "\t\t\t<synonym string=\"$syn\"/>\n"; } } else { # If there is no Japanese term for this term id, write out the Japan +ese "general term" string. print $gOutputFileObj "\t\t\t\<synonym string=\"\x{4E00}\x{822c}\x{7 +528}\x{8A9E}\" preferred=\"YES\"/>\n"; } print $gOutputFileObj "\t\t</synonyms>\n"; }
I am using Active Perl 5.8.3 Build 809. I hope I have supplied enough parts of my script to show what I am doing. I would appreciate any help in figuring out how to diagnose and/or solve this problem.sub open_output_file { $gOutputFileObj = new IO::File $gOutputFile, "w"; die "Can't open output file $gOutputFile.\n", "Error in file: \"", __FILE__, "\", at line: ", __LINE__, "\n" unles +s $gOutputFileObj; binmode($gOutputFileObj, ":utf8"); }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: utf-8 bytes dropped when printing to a file
by graff (Chancellor) on Jun 11, 2004 at 02:25 UTC | |
by moleary (Novice) on Jun 20, 2004 at 01:14 UTC | |
|
Re: utf-8 bytes dropped when printing to a file
by BrowserUk (Patriarch) on Jun 11, 2004 at 05:00 UTC |