http://qs1969.pair.com?node_id=11109831

harangzsolt33 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write a perl script which can compress a perl script without breaking it. Its job is to remove comments, new line characters, and unnecessary whitespace.

I have gotten so far that I can remove most comments and spaces, but I don't know how to interpret lines which contain regex and quotes and double-quotes. Everybody knows that I can't just chop off everything that comes after the '#' character, because if it's within a print statement, then the script no longer works:

  print "\nHello World ###";
  return;
}

would become:

print"\nHello Worldreturn;}

So, I need to come up with a logic that can interpret a line of complex perl code and remove comments and spaces in such a way that it won't break the code.

I just started learning perl 3 years ago, so I am not even sure what kind of complicated code patterns may exist in a perl script. I always see something new and marvel "What in the world is this code!! I've never seen anything like it." Lol

Ok, here is what I have done so far:

#!/usr/bin/perl -w # # This perl script can compress itself by getting rid # of comments, spaces, tabs, new lines, and reducing # the perl code to its minimal size where it will # still function. # Written by Zsolt on Dec. 7, 2019. <zsnp@juno.com> # Tested on TinyPerl 5.8 (Windows XP). # # When $WRITE_FILE is true, it will save the output to # a file called OUTPUT##.pl where ## is a number. # ################################################## use strict; use warnings; my $WRITE_FILE = 0; my $SELF = ReadFile(__FILE__); my $SHRUNK = CompactPerl($SELF); my $ORIGINAL_SIZE = length($SELF); my $SHRUNK_SIZE = length($SHRUNK); my $DIFF = $ORIGINAL_SIZE - $SHRUNK_SIZE; print '>' x 78; print "\n", $SHRUNK; print '^' x 78; print "\n\nFILE NAME : ", __FILE__; print "\nFILE SIZE : $ORIGINAL_SIZE bytes"; print "\nSHRUNK SIZE : $SHRUNK_SIZE bytes"; WriteOutput($SHRUNK) if ($WRITE_FILE); if ($SHRUNK_SIZE < $ORIGINAL_SIZE) { print "\n\nYou could eliminate $DIFF bytes by compressing this scrip +t.\n"; } print <<'END_OF_MESSAGE'; Wow, this script is amazing! * * * * * * * * * * * * * * * * Yes, it really is! :D END_OF_MESSAGE exit; ################################################## # # Creates and overwrites a file in binary mode. # Returns 1 on success or 0 if something went wrong. # Usage: INTEGER = CreateFile(FILE_NAME, CONTENT) # sub CreateFile { defined $_[0] or return 0; my $F = $_[0]; $F =~ tr#\"\0*?|<>##d; # Remove special characters length($F) or return 0; local *FH; open(FH, ">$F") or return 0; binmode FH; if (defined $_[1] ? length($_[1]) : 0) { print FH $_[1]; } close FH or return 0; print "\n$F was saved.\n"; return 1; } ################################################## # # This function reads the entire contents of a file # in binary mode and returns it as a string. If an # errors occur, an empty string is returned silently. # A second argument will move the file pointer before # reading. And a third argument limits the number # of bytes to read. # Usage: STRING = ReadFile(FILENAME, [START, [LENGTH]]) # sub ReadFile { my $NAME = defined $_[0] ? $_[0] : ''; $NAME =~ tr/\"\0*?|<>//d; # Remove special characters -e $NAME or return ''; -f $NAME or return ''; my $SIZE = -s $NAME; $SIZE or return ''; my $LEN = defined $_[2] ? $_[2] : $SIZE; $LEN > 0 or return ''; local *FH; sysopen(FH, $NAME, 0) or return ''; binmode FH; my $POS = defined $_[1] ? $_[1] : 0; $POS < $SIZE or return ''; $POS < 1 or sysseek(FH, 0, $POS); # Move file ptr my $DATA = ''; sysread(FH, $DATA, $LEN); # Read file close FH; return $DATA; } ################################################## # v2019.8.25 # Removes whitespace from before and after STRING. # Whitespace is here defined as any byte whose # ASCII value is less than 33. That includes # tabs, esc, null, vertical tab, new lines, etc. # Usage: STRING = Trim(STRING) # sub Trim { defined $_[0] or return ''; my $L = length($_[0]); $L or return ''; my $P = 0; while ($P <= $L && vec($_[0], $P++, 8) < 33) {} for ($P--; $P <= $L && vec($_[0], $L--, 8) < 33;) {} substr($_[0], $P, $L - $P + 2) } ################################################## # # Saves the shrunk code in a file named output##.pl # where ## is a number. If output1.pl already # exists, then it will create output2.pl... # Usage: WriteOutput(CONTENT) # sub WriteOutput { defined $_[0] or return; length($_[0]) or return; my $OUTPUT_FILENAME; for (my $i = 1; $i < 100; $i++) { $OUTPUT_FILENAME = "compact$i.pl"; -e $OUTPUT_FILENAME or return CreateFile($OUTPUT_FILENAME, $_[0]); } } ################################################## # This function returns 1 if STRING ends with any # of the characters in LIST. # Usage: INTEGER = EndsWithChar(STRING, LIST) # sub EndsWithChar { (@_ == 2 && defined $_[0] && defined $_[1] && length($_[1]) && lengt +h($_[0])) ? (index($_[1], substr($_[0], length($_[0]) - 1)) < 0 ? 0 : + 1) : 0; } ################################################## # # This function receives one line of perl code # which contains one or more of the following: # quotes, double-quotes, a regex pattern, and # may also contain a comment at the end of the line. # The job of this function is to remove the # comments and spaces without breaking the code! # Usage: STRING = CompactLine(STRING) # sub CompactLine { defined $_[0] or return ''; length($_[0]) or return ''; my $OUTPUT = Trim($_[0]); $OUTPUT .= "\n"; # <<< THIS PART IS UNDER CONSTRUCTION >>> return $OUTPUT; } ################################################## # # This function trims whitespace from before and after # a list of perl operators... This function modifies # the first argument and does not return anything. # Usage: TrimOperators(STRING) # # Example: TrimOperators($LINE) # This will look for ' + ' in $LINE and replace it # with '+' It will then look for ' - ' and replace # it with '-' and so forth with all the characters # listed in $OP. # sub TrimOperators { defined $_[0] or return; length($_[0]) or return; # print "\nTrimOperators>", $_[0]; my $c; my $OP = '+-,.|&*/<>?:;!%^()[]{}='; for (my $i = 0; $i < length($OP); $i++) { $c = substr($OP, $i, 1); $_[0] =~ s/\s*\Q$c\E\s*/$c/g; } # print "\nTrimOperators>", $_[0], "\n"; } ################################################## # v2019.12.7 # This function takes regular perl source code # as an input string and outputs compact perl code # that has spaces and comments removed. # Usage: STRING = CompactPerl(STRING) # sub CompactPerl { defined $_[0] or return ''; my $CODE = $_[0]; $CODE =~ tr|\r||d; my @A = split(/\n/, $CODE); # Split lines my $OP = '+-,.|&*/<>?:;!%^()[]{}='; my $STRIP; # Part of a line that comes before '#' my $P; my $LINE; my $TRIMMED; my $END_MARKER = ''; my $MULTILINE_STRING = 0; my $STOP = 0; $A[0] .= "\n"; # Skip shebang for (my $i = 1; $i < @A; $i++) { $LINE = $A[$i]; # Remove everything after the '#' sign for easier processing $P = index($LINE, '#'); $STRIP = Trim( ($P < 0) ? $LINE : substr($LINE, 0, $P) ); # Skip here-documents if ($MULTILINE_STRING == 0 && $STRIP =~ m/\s*<<['"]+([_A-Za-z0-9]* +)['"]+/) { $END_MARKER = $1; $MULTILINE_STRING = 2; $A[$i] .= "\n"; n +ext; } if ($MULTILINE_STRING == 2) { index($LINE, $END_MARKER) < 0 or $MU +LTILINE_STRING = 0; $A[$i] .= "\n"; next; } # Skip processing qq{ } text blocks if ($MULTILINE_STRING == 0 && index($STRIP, 'qq{') >= 0) { $MULTIL +INE_STRING = 1; } if ($MULTILINE_STRING == 1) { if (index($LINE, '}') < 0) { $A[$i] +.= "\n"; } else { $MULTILINE_STRING = 0; } next; } # Remove everything after __END__ if ($STRIP eq '__END__') { $STOP = 1; $MULTILINE_STRING = -1; } if ($STOP == 1) { $A[$i] = ''; next; } # Skip __DATA__ section if ($MULTILINE_STRING == 0 && $STRIP eq '__DATA__') { $A[$i] = "\n +__DATA__"; $MULTILINE_STRING = 3; } if ($MULTILINE_STRING == 3) { $A[$i] .= "\n"; next; } # Remove comment lines if (length($STRIP) == 0) { $A[$i] = ''; next; } # Any line that contains the characters ' " or ~ # is going to be processed by CompactLine() instead if ($LINE =~ m/[~\"\']+/) { $A[$i] = CompactLine($LINE); next; } $TRIMMED = Trim($STRIP); $TRIMMED =~ s/([a-z])\s+\(/$1\(/g; # Reduce 'my ($A' --> 'my($ +A' $TRIMMED =~ s/([a-z])\s+\%/$1\%/g; # Reduce 'my %A' --> 'my%A +' $TRIMMED =~ s/([a-z])\s+\$/$1\$/g; # Reduce 'my $A' --> 'my$A +' $TRIMMED =~ s/([a-z])\s+\@/$1\@/g; # Reduce 'my @A' --> 'my@A +' TrimOperators($TRIMMED); # Reduce '$A + $B' --> '$A+$B' # If a line ends with a letter and the next line starts with # the word 'or' then we'll break the code if we join the two. # So, unless a line ends with an operator symbol, # we should add a space. EndsWithChar($TRIMMED, $OP) or $TRIMMED .= ' '; # Here is an example: $LINE or $LINE; $A[$i] = $TRIMMED; } return join('', @A); } ################################################## __DATA__ Everything after this point will not get compacted. It gets written .... AS IS. No change. __END__ # This is the end of file and whatever comes after this point gets discarded forever..... ............

Replies are listed 'Best First'.
Re: Perl script compressor
by haukex (Archbishop) on Dec 08, 2019 at 08:14 UTC
    I am not even sure what kind of complicated code patterns may exist in a perl script.

    Classic example of "only perl can parse Perl" (based heavily on tye's example here):

    BEGIN { eval( rand>0.5 ? 'sub foo () {}' : 'sub foo {}' ) } foo / # /; 42;

    That's perfectly valid Perl code even under strict. Run that though B::Deparse a few times, and sometimes you'll get

    foo(/ # /); '???';

    and sometimes

    &foo() / 42;

    In other words, sometimes it's a function call to which the result of a regex is being passed followed by a constant (that gets optimized away), and sometimes it's the return value of a function call, followed by a division operator, a comment, and the divisor. Which it is depends entirely on a random number that changes from run to run, and the result can't be known until the perl binary executes the first line of Perl code; doing the same with a static parse (without executing code) is impossible.

    I need to come up with a logic that can interpret a line of complex perl code and remove comments and spaces in such a way that it won't break the code.

    Perl Cannot Be Parsed: A Formal Proof

    The closest thing to a static Perl parser is PPI (and perhaps PPR; Update 2: and new: Guacamole), and marto already pointed you to one module based on it; see also Perl::Squish. Update: You could also look at Perl::Tidy. Update 3: I think RPerl also implements its own parser.

Re: Perl script compressor
by marto (Cardinal) on Dec 08, 2019 at 07:09 UTC

    Rather than reinvented the wheel look at something like Perl::Strip.

Re: Perl script compressor
by afoken (Chancellor) on Dec 08, 2019 at 17:33 UTC
    I am trying to write a perl script which can compress a perl script without breaking it. Its job is to remove comments, new line characters, and unnecessary whitespace.

    Why?

    Just for fun or education? That would be ok.

    But I see no other reason for doing so. 180 kByte floppy disks are gone since decades, commonly available mass storage is in the GByte or TByte range, so there is no shortage of disk space. Perl's compile phase won't be speed up significantly by stripping whitespace and comments. Any kind of electronic transmission can be significantly accelerated by applying state of the art compression (e.g. bzip2, lzma) before transmission. (Just for fun: Current CGI.pm has 123 KBytes, gzip compresses that to 36 KBytes, bzip2 and lzma even down to 32 KBytes, all without any loss of information!)

    Minimizing and transparent compression, as usual for jQuery and others, does not make sense for anything but web browsers. And running perl in a web browser is possible, but anything but common.

    So what's left? Creating a maintainance nightmare, just because you can?

    Or did I miss something?

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      Or did I miss something?

      Probably. Many Javascript blurbs are delivered either as foo.js or foo-minimal.js, so that might be some "State Of The Art" thing. Or code obfuscation for *cough* EULA reasons. Or some such.

      perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
        Many Javascript blurbs are delivered either as foo.js or foo-minimal.js

        True, but I'm pretty sure JS can be parsed with a static parse, and so a reduction like that is safer than with Perl.

      Yes, I was thinking about writing a cgi script to run on a server. If I remove spaces, the OS might be able to load the whole script with one disk read. But if it's bigger, then it might take two or three. So, the smaller the code, the faster it load and the more likely it is that it will remain in the cache. So, if it runs multiple times, the OS might not even have to load it from the disk. So, that was the whole purpose of me doing this. --harangzsolt33 (I'm currently not logged in)

        I suggest using strace while running your script. If your script uses strict, warnings, CGI.pm, or any other modules, they also get opened and read into memory. If your Perl is configured to use sitecustomize.pl, that will be opened and read in. And if it's been read in once, with an OS like Linux there's a chance the files are hot and ready in a cache anyway. But strace will demonstrate to you that the top level script you load is not the largest component that gets read in from a file.

        The bulk of startup time has little to do with just reading the program file in from (hopefully) an SSD. I created two Hello World scripts; one with 13368 lines, consuming 1.1 megabytes on disk, and one with seven lines, consuming 96 bytes on disk. They both start by printing "Hello world\n", and end by printing "Goodbye world\n", but in the first script there are 13361 80-column lines of comments between the two print statements. Perl must read the entire file before getting to the final Goodbye world. Here are the timings:

        $ time ./mytest.pl Hello world Goobye world real 0m0.022s user 0m0.014s sys 0m0.008s $ time ./mytest2.pl Hello world Goodbye world real 0m0.008s user 0m0.008s sys 0m0.001s

        A tremendous increase from eight milliseconds to twenty two. We go from running 45 times per second to 125 *if* the bloated script is 1 megabyte in size, and if all that bloat (including the parts that you bring in from CPAN and core-Perl libs) can be reduced to 96 bytes. What if the source script is 64kb? Let's try that:

        $ time ./mytest.pl Hello world Goodbye world real 0m0.009s user 0m0.004s sys 0m0.005s

        So now we're talking about 1 millisecond difference. Instead of 125 runs per second, we have 111 per second, for a much more typically-sized script.

        If startup time is a problem you won't solve it by minifying your Perl script. It's better solved by converting over to a daemon process that stays resident, or if that's really impossible, scaling out horizontally.


        Dave

        You want to take a step back. What measurement have you done to examine the compilation time for a script with comments and spaces vs the same code without? If you're worried about performance profile your code (Devel::NYTProf). You should also read How can I make my CGI script more efficient? Or just move to an approach which isn't CGI scripts, it was removed for good reason, and starting anything new with it is actively discouraged.

        If you're that concerned about disk reads, I'd just copy the file into shared memory space (eg: /dev/shm) on system or web server startup, then read the script from there instead.

        Or, use a system that only has to read the file once upon web server instantiation.

        This seems like premature optimization.

        Update: I thought some more about this. If you're unit testing your code (which you should be for sure!), you'd have to run the tests again on this automatically re-written code in case something is lost in translation.

        I'm all for doing things for education and learning purposes, but I don't think the risk is worth it if the sole objective is to use the code to try to make something a fraction of a nanosecond (obviously estimated) more efficient.

Re: Perl script compressor
by clueless newbie (Curate) on Dec 08, 2019 at 14:22 UTC
    perltidy --mangle --dac ... Here's your script after perltidy --mangle --dac
    #!/usr/bin/perl -w use strict; use warnings; my$WRITE_FILE=0; my$SELF=ReadFile(__FILE__); my$SHRUNK=CompactPerl($SELF); my$ORIGINAL_SIZE=length($SELF); my$SHRUNK_SIZE=length($SHRUNK); my$DIFF=$ORIGINAL_SIZE-$SHRUNK_SIZE; print '>' x 78; print"\n",$SHRUNK; print '^' x 78; print"\n\nFILE NAME : ",__FILE__; print"\nFILE SIZE : $ORIGINAL_SIZE bytes"; print"\nSHRUNK SIZE : $SHRUNK_SIZE bytes"; WriteOutput($SHRUNK)if($WRITE_FILE); if($SHRUNK_SIZE<$ORIGINAL_SIZE){print"\n\nYou could eliminate $DIFF by +tes by compressing this script.\n";}print <<'END_OF_MESSAGE'; Wow, this script is amazing! * * * * * * * * * * * * * * * * Yes, it really is! :D END_OF_MESSAGE exit; sub CreateFile{defined$_[0]or return 0; my$F=$_[0]; $F=~tr#\"\0*?|<>##d; length($F)or return 0; local*FH; open(FH,">$F")or return 0; binmode FH; if(defined$_[1]?length($_[1]):0){print FH$_[1];}close FH or return 0; print"\n$F was saved.\n"; return 1;}sub ReadFile{my$NAME=defined$_[0]?$_[0]:''; $NAME=~tr/\"\0*?|<>//d; -e$NAME or return ''; -f$NAME or return ''; my$SIZE=-s$NAME; $SIZE or return ''; my$LEN=defined$_[2]?$_[2]:$SIZE; $LEN>0 or return ''; local*FH; sysopen(FH,$NAME,0)or return ''; binmode FH; my$POS=defined$_[1]?$_[1]:0; $POS<$SIZE or return ''; $POS<1 or sysseek(FH,0,$POS); my$DATA=''; sysread(FH,$DATA,$LEN); close FH; return$DATA;}sub Trim{defined$_[0]or return ''; my$L=length($_[0]); $L or return ''; my$P=0; while($P<=$L&&vec($_[0],$P++,8)<33){}for($P--;$P<=$L&&vec($_[0],$L--,8 +)<33;){}substr($_[0],$P,$L-$P+2)}sub WriteOutput{defined$_[0]or retur +n; length($_[0])or return; my$OUTPUT_FILENAME; for(my$i=1;$i<100;$i++){$OUTPUT_FILENAME="compact$i.pl"; -e$OUTPUT_FILENAME or return CreateFile($OUTPUT_FILENAME,$_[0]);}}sub +EndsWithChar{(@_==2&&defined$_[0]&&defined$_[1]&&length($_[1])&&lengt +h($_[0]))?(index($_[1],substr($_[0],length($_[0])-1))<0?0:1):0;}sub C +ompactLine{defined$_[0]or return ''; length($_[0])or return ''; my$OUTPUT=Trim($_[0]); $OUTPUT.="\n"; return$OUTPUT;}sub TrimOperators{defined$_[0]or return; length($_[0])or return; my$c; my$OP='+-,.|&*/<>?:;!%^()[]{}='; for(my$i=0;$i<length($OP);$i++){$c=substr($OP,$i,1); $_[0]=~s/\s*\Q$c\E\s*/$c/g;}}sub CompactPerl{defined$_[0]or return ''; my$CODE=$_[0]; $CODE=~tr|\r||d; my@A=split(/\n/,$CODE); my$OP='+-,.|&*/<>?:;!%^()[]{}='; my$STRIP; my$P; my$LINE; my$TRIMMED; my$END_MARKER=''; my$MULTILINE_STRING=0; my$STOP=0; $A[0].="\n"; for(my$i=1;$i<@A;$i++){$LINE=$A[$i]; $P=index($LINE,'#'); $STRIP=Trim(($P<0)?$LINE:substr($LINE,0,$P)); if($MULTILINE_STRING==0&&$STRIP=~m/\s*<<['"]+([_A-Za-z0-9]*)['"]+/){$E +ND_MARKER=$1;$MULTILINE_STRING=2;$A[$i].="\n";next;}if($MULTILINE_STR +ING==2){index($LINE,$END_MARKER)<0 or$MULTILINE_STRING=0;$A[$i].="\n" +;next;}if($MULTILINE_STRING==0&&index($STRIP,'qq{')>=0){$MULTILINE_ST +RING=1;}if($MULTILINE_STRING==1){if(index($LINE,'}')<0){$A[$i].="\n"; +}else{$MULTILINE_STRING=0;}next;}if($STRIP eq '__END__'){$STOP=1;$MUL +TILINE_STRING=-1;}if($STOP==1){$A[$i]='';next;}if($MULTILINE_STRING== +0&&$STRIP eq '__DATA__'){$A[$i]="\n__DATA__";$MULTILINE_STRING=3;}if( +$MULTILINE_STRING==3){$A[$i].="\n";next;}if(length($STRIP)==0){$A[$i] +='';next;}if($LINE=~m/[~\"\']+/){$A[$i]=CompactLine($LINE); next;}$TRIMMED=Trim($STRIP); $TRIMMED=~s/([a-z])\s+\(/$1\(/g; $TRIMMED=~s/([a-z])\s+\%/$1\%/g; $TRIMMED=~s/([a-z])\s+\$/$1\$/g; $TRIMMED=~s/([a-z])\s+\@/$1\@/g; TrimOperators($TRIMMED); EndsWithChar($TRIMMED,$OP)or$TRIMMED.=' '; $LINE or$LINE; $A[$i]=$TRIMMED;}return join('',@A);} # sub CompactPerl __DATA__ Everything after this point will not get compacted. It gets written .... AS IS. No change. __END__ # This is the end of file and whatever comes after this point gets discarded forever..... ............
      Yes, but.... Have you tried to run it after perltidy? If you run it, it prints this:

      You could eliminate 261 bytes by compressing this script.

      In other words, my compressor script was able to eliminate another 261 bytes after perltidy did its job. So, I am not even finished with this script, yet it was already able to "outperform" perltidy by yielding a smaller code. :D

        If you run your program on itself ... you get
        FILE SIZE : 8688 bytes SHRUNK SIZE : 4021 bytes
        If you run perltidy on your program the result is 3,701 bytes.