comment on

I wrote a perl script that reads lines of Japanese text from a utf-8 encoded text file and merges them into an XML output file that is also utf-8 encoded.

There are about 160,000 lines of Japanese text, and in about 50 of the lines, one of the Japanese characters comes out in the output file as garbage characters because the lead one or two bytes of a utf-8 character are not output to the file. In each case, the single byte value of the garbage character corresponds to the second or third byte of the utf-8 character that was supposed to be written there.

I added some debugging code to write just the Japanese text, without XML tags, into a separate utf-8 encoded file, and the character that is written incorrectly in the XML output file comes out correctly in the debug file.

Even odder, the script reads an XML file that contains entries from other languages and writes those to the output XML file along with entries created from the Japanese text, and if I remove any entries from the input XML file, the corruption in the output file doesn't occur where it used to. (I haven't been able to find out yet if it goes away or just occurs in different places.)

I also added calls to utf8::is_utf8 and utf8::valid on the Japanese strings, and they return true for all of them. The function that writes out the Japanese text looks like this:

sub print_japanese_synonyms
{
 my ($current_term_id) = @_;
 my $term = $gJATermHash{$current_term_id};

 print $gOutputFileObj "\t\t<synonyms language=\"JA\">\n";
 if ($term) {
  my $pref_term = $term->get_preferred();
  my @synonyms = $term->get_synonyms();

  print $gOutputFileObj "\t\t\t<synonym string=\"$pref_term\" preferre
+d=\"YES\"/>\n";
  foreach my $syn (@synonyms) {
   print $gOutputFileObj "\t\t\t<synonym string=\"$syn\"/>\n";
  }
 } else {
  # If there is no Japanese term for this term id, write out the Japan
+ese "general term" string.
  print $gOutputFileObj "\t\t\t\<synonym string=\"\x{4E00}\x{822c}\x{7
+528}\x{8A9E}\" preferred=\"YES\"/>\n";
 }
 print $gOutputFileObj "\t\t</synonyms>\n";
}
[download]

I have "use utf8;" set and the function that opens the output file looks like this:

sub open_output_file
{
 $gOutputFileObj = new IO::File $gOutputFile, "w";
 die "Can't open output file $gOutputFile.\n",
  "Error in file: \"", __FILE__, "\", at line: ", __LINE__, "\n" unles
+s $gOutputFileObj;
 binmode($gOutputFileObj, ":utf8");
}
[download]

I am using Active Perl 5.8.3 Build 809. I hope I have supplied enough parts of my script to show what I am doing. I would appreciate any help in figuring out how to diagnose and/or solve this problem.

In reply to utf-8 bytes dropped when printing to a file by moleary

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.