Inconsistent transliteration for non-printing octets

diotalevi has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying out the transliteration operator with normally non-printing octet values largely to understand who it works. On it's fact it's just non-intuitive and I'm asking for help here. You'll notice from both of these cases that the transliteration patterns are both of the same format - three characters with three other characters using the /c invert and /d delete modifiers. The first example behaves normally while the second does an odd thing before hitting the expected patterns.

So can someone explain what I'm doing wrong here or where the bug is? I don't even know which should be the expected behavior. I get these results on OpenBSD perl 5.6.1, W2K ActiveState perl 5.6.1 build 633 and W2K Cygwin perl 5.8.0. I'm using this without unicode so I'm specifically referring to "character" as "octet" to avoid confusion.

$_ = "1234567890";
tr [357]
   [888]cd;
print $_, $/;
# prints 357

$_ = join '', map chr(), 0 .. 0x1f;
tr [\3\5\7]
   [\10\10\10]cd;
print join('', map sprintf("\\%o",ord), split //), $/;
# prints \10\10\10\3\5\7
[download]

__SIG__
use B;
printf "You are here %08x\n", unpack "L!", unpack "P4", pack
  "L!", B::svref_2object(sub{})->OUTSIDE;
[download]

Comment on Inconsistent transliteration for non-printing octets Select or Download Code

Replies are listed 'Best First'.
Re: Inconsistent transliteration for non-printing octets by John M. Dlugosz (Monsignor) on Nov 25, 2002 at 22:11 UTC
Read the perlop manpage, under regex operators. Actually, if you look it up in perlfunc it will point you there. If the /c modifier is specified, the SEARCHLIST character set is complemented. If the /d modifier is specified, any characters specified by SEARCHLIST not found in REPLACEMENTLIST are deleted. So, /c is like the `[^...` in a character class. That means your first example, '3','5', '7', really means everything except those three characters. So, characters \0, \1, and \2 all map to '8'. With the /d flag, anything not in the replacement set (if the replacement set is shorter than the match set) is deleted. So, the three chars you specified are kept, and everything other than those 6 are deleted. That matches your output: \0 -> '8', \1 -> '8', \2 -> '8', '3' -> '3', '5' -> '5', '7'->'7', and everything else is deleted. —John	[reply] [d/l]
Re: Inconsistent transliteration for non-printing octets by RMGir (Prior) on Nov 25, 2002 at 22:12 UTC
Cool! It seems to depend on whether chr(0), chr(1), and chr(2) are in your string. I'm guessing those are used as "magic" during the tr? In any case, try these tests: $_ = join '', map chr(), 1 .. 0x1f; tr [\3\5\7] [\10\10\10]cd; print join('', map sprintf("\\%o",ord), split //), $/; # prints \10\10\3\5\7 $_ = join '', map chr(), 2 .. 0x1f; tr [\3\5\7] [\10\10\10]cd; print join('', map sprintf("\\%o",ord), split //), $/; # prints \10\3\5\7 $_ = join '', map chr(), 3 .. 0x1f; tr [\3\5\7] [\10\10\10]cd; print join('', map sprintf("\\%o",ord), split //), $/; # prints \3\5\7 $_ = join '', reverse map chr(), 0,0,0,1,0,1,0,1,0 .. 0xA; tr [\3\5\7] [\10\10\10]cd; print join('', map sprintf("\\%o",ord), split //), $/; #prints \7\5\3\10\10\10\10\10\10\10\10\10\10\10 [download] It also appears that if you use more characters in the left hand side of the tr///, more of the "low byte values" become magical. Sorry I can't tell you WHY it happens, but I hope this helps... -- Mike	[reply] [d/l]