When I'm trying to reply to someone's post, I often want to take their code and play with it. Unfortunately, if I cut and paste from my browser, I lose all formatting and the code all runs together in one big line. I find it a pain to try to break all of the lines of code apart by hand.

If I "view source", I have a bunch of funny HTML characters that have to be translated to their ASCII equivalent. I wrote this little program to deal with this. To use it, I "view source", take the appropriate code section and save it to a file. I then run this program using the path and filename of the new Perlmonks code as the argument. This code goes through and converts the HTML character codes to their ASCII equivalent and also corrects for that annoying $some_var[0] problem that happens when people post with pre tags instead of code tags (yes, I wrote $some_var[0] incorrectly on purpose). It then writes the new code to the file it read it from.

Incidentally, I am not posting this to the CODE section as it's just a quick hack and not worthy of being there. Also, I have overcommented the program so new Monks can understand some of the weird bits.

#!/usr/bin/perl -w use strict; # Don't you dare write code without this line and the -w +switch my %charcodes; my ($filename = shift @ARGV); %charcodes = ( "&#091;" => "[", "&#093;" => "]", "&#91;" => "[", "&#93;" => "]", "&quot;" => "\"", "&lt;" => "<", "&gt;" => ">", "&amp;" => "&" ); # Using '+<' to open the file in update mode open (FILEHANDLE, "+< $filename") || die "Can't open $filename in upda +te mode: $!\n"; # Reading the entire file into the array my @program_line = <FILEHANDLE>; foreach (@program_line){ # The regex below might confuse some people new to perl, # so I'll do some explaining here. # You might think that I could use &.*; to match a hash value. # This fails for two reasons: # 1. We might have a sub which is identified with ampersand # 2. If there is more than one semicolon after the ampersand, # the regex will be "greedy" and will include the # rightmost semicolon. We can use &.*?; to try to force # the regex to be lazy, but this could involve a lot of # backtracking and make the regex less efficient. # &[^;]{2,6}; is a good regex. The negated character class guaran +tees # that we will only match 2 to 6 non-semicolons after the ampersan +d # (and we go out to six characters in case this script is upgraded # to translate things like &eacute; to é.) # The right side of this substitution uses the trinary operator # ($x = ($a > b) ? $a : $c) to substitute the hash value of of # character code if such hash value exists, otherwise it substitue +s $1 # back to itself. This is not the most efficient way of doing thi +s as # we have a null substitution, but it works. # The /e modifier makes the trinary operator executable. # The /g modifier makes the regex global (i.e. we will modify ever +y # character code on a single line s/(&[^;]{2,6};)/(exists $charcodes{$1}) ? $charcodes{$1} : $1/eg; # The following code will correct for URL expansion of code like # $some_hash_var[0] which gets posted with <PRE> tags rather # than <CODE> tags. Don't use it for other URL substitutions # because it relies on Perlmonks specific syntax. s/(<a href="[^"]+">(\d+)<\/a>)/[$2]/g; } # Go back to start of file seek(FILEHANDLE, 0, 0) or die ("Seek failed on $filename: $!\n"); print FILEHANDLE @program_line or die ("Print failed on $filename: $!\ +n"); # truncate the file so we don't have excess garbage at the end truncate(FILEHANDLE, tell(FILEHANDLE)) or die ("Truncate failed on $fi +lename: $!\n"); close (FILEHANDLE) or die ("Close failed on $filename: $!\n");
I hope someone finds this helpful. Also, any suggestions for improvements would be most welcome.

Cheers,
Ovid

Replies are listed 'Best First'.
RE: Using code posted on PerlMonks
by reptile (Monk) on Jul 11, 2000 at 00:45 UTC

    Pretty good, although (and I haven't tested this or anything, so I could be wrong) I believe the last regex could pose a problem at times...

    s/(<a href="[^"]+">(\d+)<\/a>)/[$2]/g;

    What if the person wrote something like $array[$foo]? Since it only matches when there's a number in the regex, it won't work in this instance. Perhaps something like this:

    s/(a href="[^"]+">(\$\w+|\d+)/[$2]/g;

    Which would accept all digits or any word preceeded by a $. Of course it still doesn't predict all possibilities, like using a hash value or another array, or taking an array slice, etc. but that particular would tend to crop up pretty commonly I imagine.

    All in all, a nice, useful script. Thanks for posting it.

    local $_ = "0A72656B636148206C72655020726568746F6E41207473754A"; while(s/..$//) { print chr(hex($&)) }

      That's a nice point. I threw this code together rather quickly, so I didn't think of all possibilities. That last regex could definitely use some beefing up and I think you've given it a good start. Thanks for the feedback!

      Just a quick (untested) stab:

      s/<a href="[^"]+">([^<]+)<\/a>/[$1]/g;
      This is pretty generic and I would only use it in this context because there is an assumption that the data will be more or less valid. I think it would catch most cases. Also, I dropped a set of parentheses. I have no frickin' idea why I had them in the first place.

      Cheers,
      Ovid

      Update: I've now tested this regex modification and it appears to work fine. I also realized a potential bug in the code that would be rare, but difficult to circumvent: if someone names a sub "amp" (or another HTML character code name) and tries to call the sub with '&amp;', this code will cheefully convert that to '&' and the resulting code will not function properly.

      Update #2: buzzcutbuddha pointed out that the "Bug" I mentioned wasn't behaving quite the way I posted. It will only occur if someone names a subroutine "amp" and then posts the code without code tags. Highly unlikely, but a possibility.

        in response to your update, wouldn't you actually get
        &amp;amp;
        in ASCII/escaped HTML if someone named a subroutine amp?

        To check for the above use
        /\&(amp;)($1)/
        and you would catch it. Just a quick thought. :)
code-parser (Re: Using code posted on PerlMonks)
by ar0n (Priest) on Jul 11, 2000 at 00:13 UTC
    I like it.

    An idea occured to me (rare occasion):
    vroom could have a link "view code" with any post
    containing the <code> and </code>-tags.

    The link would simply point to a CGI script (your script)
    that parses all the <code>-parts of the post and
    displays each <code> entry (seperated by a line ('=' x 50 or so))
    in a plain text-file.

    Am I making sense?

    -- ar0n
      Well, I was thinking something along the lines of providing a code download link. However, I'm not sure of the bandwidth and processing considerations that would be involved. I understand what you're saying and it could provide a means of viewing all code sections in a thread on one page. I suppose it's another idea that will be kicked around.

      Cheers,
      Ovid

RE: Using code posted on PerlMonks
by Ovid (Cardinal) on Jul 11, 2000 at 19:54 UTC
    I've run some tests on the updated version of this code on debug the error!!, a 500 line monstrosity posted without code tags, and Slashdot Headline Grabber for *nix, which was posted with <CODE> tags and it appears to run fine.

    For the program to properly parse code that was posted without code tags, it will require the following regex to be substituted for the last regex in the foreach loop.

    s/<a href="[^"]+">([^<]+)<\/a>/[$1]/g;
    Because I can't resist tinkering, I've decided to add the following enhancements (for a program that really has limited use!) in the form of a summary at the end of the script run:
    • Possible code fragment (no shebang '#!').
    • Verify balanced parens, quotes, curlies, etc. This will require ensuring that I don't accidentally pull in escaped characters such as \".
    • Expand %charcodes hash to be more inclusive?
    Other suggestions would be welcome.