The Unicode Bug with Transliteration or Substitution

choroba has asked for the wisdom of the Perl Monks concerning the following question:

Hi brethren and sestren.

On several machines at work, we run Perl 5.8.3 (yes, I know it's 10 years old; not my choice). We noticed a strange behaviour recently: we used

tr/ / /s;
[download]

to process some HTML files. If the files contained non-latin characters (e.g. Chinese), on some machines the output was garbled. We tried to replace tr with substitution

s/ +/ /g;
[download]

and suddenly, the output was correct.

Both input and output are marked with :encoding(utf-8). The files must be slurped in to trigger the bug, line-by-line processing produces the correct output.

Could this be one of the manifestations of The "Unicode Bug"? I have the gut feeling that the substitution might solve the problem for the given file, but the bug could reappear with the next different file. I also don't understand why the bug only appeared on some machines - the version of Perl is the same on all of them (but their Linux version is different). Is any external library involved in transliteration, substitution, or unicode handling?

BTW: I wasn't able to install 5.8.3 at home (errors during make) to test further. Update: I was able to install it with the help of Devel::PatchPerl. I wasn't able to replicate the problem, though.

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Comment on The Unicode Bug with Transliteration or Substitution Select or Download Code

Replies are listed 'Best First'.
Re: The Unicode Bug with Transliteration or Substitution by moritz (Cardinal) on May 03, 2014 at 16:16 UTC
I tried a minimal example with perl 5.8.3, and tr/// didn't break non-ASCII (and non-Latin-1) characters in decoded strings. It also doesn't transliterate away UTF-8 bytes from a decoded string. So my minimal tests weren't able to reproduce your problem. Of course, your mileage varies, so more details might be of interest. Perl 6 - the future is here, just unevenly distributed	[reply]
Re: The Unicode Bug with Transliteration or Substitution by Anonymous Monk on May 03, 2014 at 00:11 UTC
It could be :) so you have sample data to play with? Have you tried utf8::upgrade($string)? Maybe you can try Unicode::Semantics	[reply]
Re^2: The Unicode Bug with Transliteration or Substitution by choroba (Cardinal) on May 03, 2014 at 20:07 UTC
You can use the Japanese Wikipedia Perl page . Perl 5.8.3 at work outputs different files for `tr/ / /s; tr/\t/ /s;` [download] and `s/ +/ /g; s/\t+/ /g;` [download] I tested with `diff -w` against the original, i.e. ignoring whitespace. utf8::upgrade didn't change anything, before or after the substitution/transliteration. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l] [select]
Re: The Unicode Bug with Transliteration or Substitution by graff (Chancellor) on May 04, 2014 at 03:15 UTC
Regarding the different behavior of some machines (given that they all have Perl 5.8.3), I'm sorry that I don't have explicit details that would be relevant, but looking over some email from 2004 just now, I noticed that Dan Kogai was releasing updates to the Encode module independently of perl releases. I know there were some subtle but notable bugs in earlier Encode releases. I wonder if your various systems with 5.8.3 might have different versions of Encode.	[reply]
Re^2: The Unicode Bug with Transliteration or Substitution by choroba (Cardinal) on May 04, 2014 at 19:59 UTC
Thank you. Some of the machines indeed had a different version of the Encode module. There are still some, though, that have the same version, but produce different results. The only difference I can see is one of them is 32 bit, while the second one is 64 bit (but Perl is 32 bit). Update: I ran the process via strace on both machines. One of the many differences I noticed was the size of the read buffer: on the 32 bit machine, `read(3` is called with the buffer size of 32768, while on the 64 machine, the size is 65536. There might be a problem if a multibyte character is split between two subsequent buffers. It would also explain why the output is not different when the input is processed line by line (no line is longer than 32768 bytes). It still doesn't explain why substitution fixes the problem, though. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^3: The Unicode Bug with Transliteration or Substitution by graff (Chancellor) on May 05, 2014 at 02:26 UTC
I would have expected that if a file larger than the low-level input buffer size has been slurped, then the contents of multiple, consecutive `read (3)` calls have simply been concatenated without further ado into a single string buffer, prior to whatever processing comes next in the script. Given that the perl version and Encode version are the same, differences in cpu "native word" size and read buffer size should have no impact. (Rather, if the word/buffer size had any impact, it should affect other behaviors on slurped files, not just `tr/ / /s` vs. `s/ +/ /g`.) So, when you compared your two machines that were both 32-bit 5.8.3 with the same version of Encode, but 32-bit vs. 64-bit cpus (smaller and larger read buffers), which one had the strange behavior with `tr/ / /s` going crazy? Did the differences in Encode versions on other machines show any relation to the strange behavior? (Were you able to look at the release notes of the later Encode version(s) to see if anything relevant was fixed?)	[reply] [d/l] [select]
Re^4: The Unicode Bug with Transliteration or Substitution by choroba (Cardinal) on May 14, 2014 at 20:40 UTC
Re: The Unicode Bug with Transliteration or Substitution by Anonymous Monk on May 03, 2014 at 14:03 UTC
If you have some version of Perl running on your home machine, you can try installing Devel::PatchPerl. Then, after you expand your Perl 5.8.3 distribution but before you run `Configure`, cd to the expanded distribution and run `patchperl`. This might get you through the build, though it does not guarantee a clean test.	[reply] [d/l] [select]
Re^2: The Unicode Bug with Transliteration or Substitution by choroba (Cardinal) on May 03, 2014 at 23:17 UTC
Thank you. I was able to install 5.8.3 after patching the sources. I even got clean tests. I wasn't able to reproduce the problem in the Perl 5.8.3 I installed, though. The output was the same for both transliteration and substitution. So, it seems, our historical Perl installation is buggy. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: The Unicode Bug with Transliteration or Substitution by soonix (Canon) on May 04, 2014 at 06:27 UTC
The link in your OP mentions dependancy on Locales - did you already compare locale settings for your machines (or better: for the relevant processes)? If they have the same settings: perhaps the locale files are different?	[reply]


"be consistent"
	PerlMonks