Re: JAPH with shifting (explained)

As nobody has commented on this (not sure if that's because it's too ugly or just rubbish), I've added an explanation of the code below. It's nowhere near as clever as I wanted it to be. But the same thing could probably be said of me.

The one liner runs from the command line in (at least) Windows and Linux, and copes with UTF8 or Latin-1 character sets.

For each chunk of code, I present the original code, then explain it in order of execution.

Making it work on the CLI of Windows and Linux

Making it run from the CLI in Windows and Linux is a problem because of the different quoting schemes. Windows requires double quotes, but if you give bash double quotes, then it will try to interpolate any variables starting with $.

So I had to store the program in a string, from which I'd replaced all of the $ sigils with 'Z'. At startup, a regex restores all of the $ sigils, and eval's the string:


    map { s/Z/\x24/g; eval() } map { m/(.*)/ } q{THE PROGRAM};
    
    ## The statements in order of execution:
    q{THE PROGRAM}  # The string containing the program, all $ replace
+d with Z
   
    map{m/(.*)/}    # Put the string in $_, otherwise the regex fails
                    # with "Can't modify constant item in substitution
+"
     
    map {           # Using the variable $_
        s/Z/\x24/g; # Replace all 'Z' with '$' (to replace the missing
+ sigils)
        eval()      # and do a string eval
    }
[download]

So the actual program looks like this:


    map {
        $\.=chr(96+( $b | $_ >> 4 & 63 || 0xc0 ) & 255);
        $b =$_ <<8>>4 & 0xf0 
    }
    map { ord }
    map { 
        utf8::decode($_);
        m/([^Â])/g
    }
    qq(¡Q1@\0\cPàñ@\xc2\x80Q \cA\0Q À\0\xc2\x80\cP0°Q );
    $\.=qq[,\n],
    print''
[download]

Storing the result in $\
The contents of the $\ ($OUTPUT_RECORD_SEPARATOR) variable is printed after the last argument to print. So by storing the string in $\, I can call print '' at the end of my program, and it prints an empty string, followed by the contents of $\.
I cheated by appending comma new-line to $\ just before printing.
```
   $\.=qq[,\n],
   print ''
[download]
```
Shifting 4 bits
The idea was to take the ord of each letter in the string "just another perl hacker", and store the four high bits of each letter as the four low bits of the previous letter, and the four low bits as the four high bits of the current letter. For example:
```
   Char Ord   Bits
   --------------------
    \0        0000 0000
    j   106   0110 1010
    u   117   0111 0101
    
    Discard the first 4 bits of \0, leaves:
    0000 0110   -> 6    -> \cF
    1010 0111   -> 167  -> \xA7
    0101 ....
[download]
```
See below the next point for the code which handles this.

Altering the ord of each character

I added to this by first changing the value of the ord of each letter. The range used was :

   space    32    01100000
   a        97    01100001 
to u       117    01110101
[download]

The letters can be stored in 5 bits, because 117-97 = 20, which is less than 32 (2**5). So the 3 high bits could be discarded. 'a' would become 1, 'b' -> 2, etc. The space would become zero. This was calculated using ord($char)-96 & 31. The & 31 removes all but the 5 lowest bits.

The code which handles the last two points (in order of execution) is:


    (each separate char)     # The string split into a list of chars

    map{ ord }               # The ordinal value of the char in $_

    map{                     # uses $b to buffer the 4 bits from the p
+revious 
                             # character
                            
      $\ .=                  # Append to $\...
        chr(                 # the character with ordinal value of
          96+(               # the 96 ( 01100000 ) that was removed pl
+us
            $b               # the buffered bits from the previous cha
+r
                             # stored in $b's 4 high bits (11110000)
              |              # Fill the four low bits with
                $_ >> 4 & 63 # the four high bits from the current cha
+racter
            || 0xc0          # If zero (ie space), use 192 (11000000) 
+instead
          )
          &255               # for space, 192 + 96 = 288 (100100000), 
+so remove 
                             # the 9th bit which leaves 32 (00100000)
      );
      $b =                   # buffer the 4 low bits of $_ as the 4 hi
+gh bits 
                             # $b by:
           $_<<8>>4          # equiv of $_ << 4 (ie shift $_ left by 4
+)
            & 0xf0           # and zero all but bits 5-8
    }
[download]

Handling UTF8 / Latin-1
Because I wanted to store the string as printable characters (where possible), I needed to take into account the fact that the character set in use could be UTF8 or non-UTF8.
The first 255 characters of Unicode are the same as the first 255 characters of ASCII (0-127) and Latin-1 Supplement (128-255). In UTF8, these are represented by the same value, but with a UTF8 byte in front of each character. For instance:
```
    Char Perl/Latin-1    UTF-8
    --------------------------
    é    A9              C3 A9
    °    B0              C2 B0
[download]
```
I could take a UTF8 string, run it through utf8::decode() (see utf8) and get the single byte values that I wanted. If the string wasn't UTF8 encoded (and thus was already in the byte form that I wanted), then the decode would fail silently.
There was one problem character: \x80, which is not a printable character, so I had to represent it in byte form. But for the utf8::decode to succeed, I had to use the UTF8 representation: \xC2 \x80.
This left me with a problem when the string wasn't in UTF8, as I'd have this extra \xC2 'Â' character. So instead of using split '', $string to split the string into individual characters, I did this:
```
    map{
        utf8::decode($_);   # Try to decode the string from UTF8
                            # If it fails, it leaves Â (\xC2) characte
+rs
        m/([^Â])/g          # So match each character except for Â
    }
    qq(The string)
[download]
```
The string itself
The string looks like this:
```
    ¡Q1@\0\cPàñ@\xc2\x80Q \cA\0Q À\0\xc2\x80\cP0°Q
[download]
```
There are a number of literal characters, null bytes (\0) and control characters (\cA => chr(10)), plus the \xC2 \x80 discussed above.

Here endeth the lesson :)

Clint

Comment on Re: JAPH with shifting (explained) Select or Download Code

Replies are listed 'Best First'.
Re^2: JAPH with shifting (explained) by goibhniu (Hermit) on Aug 21, 2007 at 17:16 UTC
I like it - I think the shifting stuff is cool. Could it be adapted to become a uuencoder (sans obfu) or something else cool? Did a real world example inspire you to work on this? I'm too new to Perl to be able to de-objuscate. I appreciate the explaination. (I'll write one up on my own soon). I humbly seek wisdom.	[reply]
Re^3: JAPH with shifting (explained) by clinton (Priest) on Aug 21, 2007 at 17:27 UTC
I know very little about uuencoding and even less about pack/unpack, but I have a sneaking suspicion that `pack/unpack` would be the way to go with that. Could somebody give an example of this? The "inspiration" came from using a hex editor to debug UTF8 issues, and originally, I wanted to use the bytes from UTF8 characters in a sentence to hold the latin bytes for "just another Perl hacker,". But of course the first 255 characters in UTF8 and Latin-1 are the same, so I couldn't do it without resorting to a different script within UTF8, which would mean that it wouldn't work on many machines. So I settled for making it unreadable instead :) Clint	[reply] [d/l]