in reply to JAPH with shifting
The one liner runs from the command line in (at least) Windows and Linux, and copes with UTF8 or Latin-1 character sets.
For each chunk of code, I present the original code, then explain it in order of execution.
Making it run from the CLI in Windows and Linux is a problem because of the different quoting schemes. Windows requires double quotes, but if you give bash double quotes, then it will try to interpolate any variables starting with $.
So I had to store the program in a string, from which I'd replaced all of the $ sigils with 'Z'. At startup, a regex restores all of the $ sigils, and eval's the string:
So the actual program looks like this:map { s/Z/\x24/g; eval() } map { m/(.*)/ } q{THE PROGRAM}; ## The statements in order of execution: q{THE PROGRAM} # The string containing the program, all $ replace +d with Z map{m/(.*)/} # Put the string in $_, otherwise the regex fails # with "Can't modify constant item in substitution +" map { # Using the variable $_ s/Z/\x24/g; # Replace all 'Z' with '$' (to replace the missing + sigils) eval() # and do a string eval }
map { $\.=chr(96+( $b | $_ >> 4 & 63 || 0xc0 ) & 255); $b =$_ <<8>>4 & 0xf0 } map { ord } map { utf8::decode($_); m/([^Â])/g } qq(¡Q1@\0\cPàñ@\xc2\x80Q \cA\0Q À\0\xc2\x80\cP0°Q ); $\.=qq[,\n], print''
The contents of the $\ ($OUTPUT_RECORD_SEPARATOR) variable is printed after the last argument to print. So by storing the string in $\, I can call print '' at the end of my program, and it prints an empty string, followed by the contents of $\.
I cheated by appending comma new-line to $\ just before printing.
$\.=qq[,\n], print ''
The idea was to take the ord of each letter in the string "just another perl hacker", and store the four high bits of each letter as the four low bits of the previous letter, and the four low bits as the four high bits of the current letter. For example:
See below the next point for the code which handles this.Char Ord Bits -------------------- \0 0000 0000 j 106 0110 1010 u 117 0111 0101 Discard the first 4 bits of \0, leaves: 0000 0110 -> 6 -> \cF 1010 0111 -> 167 -> \xA7 0101 ....
I added to this by first changing the value of the ord of each letter. The range used was :
space 32 01100000 a 97 01100001 to u 117 01110101
The letters can be stored in 5 bits, because 117-97 = 20, which is less than 32 (2**5). So the 3 high bits could be discarded. 'a' would become 1, 'b' -> 2, etc. The space would become zero. This was calculated using ord($char)-96 & 31. The & 31 removes all but the 5 lowest bits.
The code which handles the last two points (in order of execution) is:
(each separate char) # The string split into a list of chars map{ ord } # The ordinal value of the char in $_ map{ # uses $b to buffer the 4 bits from the p +revious # character $\ .= # Append to $\... chr( # the character with ordinal value of 96+( # the 96 ( 01100000 ) that was removed pl +us $b # the buffered bits from the previous cha +r # stored in $b's 4 high bits (11110000) | # Fill the four low bits with $_ >> 4 & 63 # the four high bits from the current cha +racter || 0xc0 # If zero (ie space), use 192 (11000000) +instead ) &255 # for space, 192 + 96 = 288 (100100000), +so remove # the 9th bit which leaves 32 (00100000) ); $b = # buffer the 4 low bits of $_ as the 4 hi +gh bits # $b by: $_<<8>>4 # equiv of $_ << 4 (ie shift $_ left by 4 +) & 0xf0 # and zero all but bits 5-8 }
Because I wanted to store the string as printable characters (where possible), I needed to take into account the fact that the character set in use could be UTF8 or non-UTF8.
The first 255 characters of Unicode are the same as the first 255 characters of ASCII (0-127) and Latin-1 Supplement (128-255). In UTF8, these are represented by the same value, but with a UTF8 byte in front of each character. For instance:
I could take a UTF8 string, run it through utf8::decode() (see utf8) and get the single byte values that I wanted. If the string wasn't UTF8 encoded (and thus was already in the byte form that I wanted), then the decode would fail silently.Char Perl/Latin-1 UTF-8 -------------------------- é A9 C3 A9 ° B0 C2 B0
There was one problem character: \x80, which is not a printable character, so I had to represent it in byte form. But for the utf8::decode to succeed, I had to use the UTF8 representation: \xC2 \x80.
This left me with a problem when the string wasn't in UTF8, as I'd have this extra \xC2 'Â' character. So instead of using split '', $string to split the string into individual characters, I did this:
map{ utf8::decode($_); # Try to decode the string from UTF8 # If it fails, it leaves  (\xC2) characte +rs m/([^Â])/g # So match each character except for  } qq(The string)
The string looks like this:
There are a number of literal characters, null bytes (\0) and control characters (\cA => chr(10)), plus the \xC2 \x80 discussed above.¡Q1@\0\cPàñ@\xc2\x80Q \cA\0Q À\0\xc2\x80\cP0°Q
Here endeth the lesson :)
Clint
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: JAPH with shifting (explained)
by goibhniu (Hermit) on Aug 21, 2007 at 17:16 UTC | |
by clinton (Priest) on Aug 21, 2007 at 17:27 UTC |