deadnancy has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

From the world of knows-just-enough-to-be-dangerous: I have a form field on an HTML page where users can enter information. The problem is that they often cut and paste from programs that automatically create special characters like 'smart' quotes (“ & ”), true apostrophes (’) and em dashes (—).

Wisdom on security for user input seems to be that one doesn't block what might be bad; one only allows what is known to be good. So how do I detect these special characters? I'm at a loss! (And I want these special characters to be available. I hate seeing the shorthand for 12 inches (') used as an apostrophe (’).)

This sort of thing is easy enough:
$_value =~ s/>/&gt;/g; $_value =~ s/</&lt;/g;
But these (for a true apostrophe) fail completely:
$_value =~ s/’/&rsquo;/g; # typed from keyboard $_value =~ s/%92/&rsquo;/g; # uri encoding $_value =~ s/&rsquo;/&rsquo;/g; # should never work
Documentation for CGI.pm says that "by default, all HTML that are emitted by the form-generating functions are passed through a function called escapeHTML()." I'm seeing nothing being escaped, tho, and I don't believe I've changed any defaults:
our %_form; our $_value; our $_query = CGI->new(); my @_field_names = $_query->param; foreach (@_field_names) { $_value = $_query->param($_); # convert nasty and/or special chars to html codes $_form{$_} = $_value; }
Same documentation also advises that "if you manually change the charset, either by calling the charset() method explicitly or by passing a -charset argument to header(), then all characters will be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings." (Emphasis is mine.)

That seems a little overkill.

Please advise! Thank you.

xox,
Dead Nancy

p.s.: Sorry if this is covered somewhere obvious, but many hours spent searching this site and the web in general have turned up little.


UPDATE

Ok; the problem with apostrophes (and the like) is solved. Thanks, ysth! Now the security: here's what I've got as the code for converting form data (likely pasted from Word) into something safe to be handled:
foreach (@_field_names) { $_value = $_query->param($_); # convert special chars to html codes $_value =~ s/\x91/&lsquo;/g; # smart quotes $_value =~ s/\x92/&rsquo;/g; $_value =~ s/\x93/&ldquo;/g; $_value =~ s/\x94/&rdquo;/g; $_value =~ s/\x96/&ndash;/g; # dashes $_value =~ s/\x97/&mdash;/g; $_value =~ s/\x7C/&#124;/g; # pipe $_value =~ s/</&lt;/g; # brackets $_value =~ s/>/&gt;/g; $_value =~ s/{/&#123;/g; $_value =~ s/}/&#125;/g; # only allow the known good if ($_value =~ /([\w\s\.\@\&\ \!\'\"\-\,\/\#\:\;\(\)]+)/) { $_value = $1; } else { die("(Friendly error message)"); } $_form{$_} = $_value; }
Is this decent code? Is there some way to compress all those s// statements? (There are many of them, but input here is an artist's statement, and we can be creative.) Am I overlooking something horribly obvious? The if->then statement drops everything after a character it doesn't like, but that's handy for finding the point of trouble. Still, seems like it could be handled better.

Thanks again for the help! Other than the intimidation factor for n00bs, this place is an amazing resource.

DN


UPDATE

Ovid reminds us: “Here’s a good rule to remember: Always trust your users. Never trust their input.” (Thanks, rW.) I learned the hard way tonight while trying things out. I thought I’d covered all bases, but despite a very large allowance for special characters, the very first statement I tried with the code above showed that I missed the dollar sign, asterisk and elipses (‘…’—single character, different from three periods), all used legitimately. I also found that, within the same document, quotes and apostrophes were used both correctly and incorrectly in different places. Now I’m wondering if I should (attempt to) correct mistakes on the part of my users. Is there Typography::Correct?

DN

Replies are listed 'Best First'.
Re: apostrophes and security
by rinceWind (Monsignor) on Aug 15, 2005 at 12:14 UTC
    Wisdom on security for user input seems to be that one doesn't block what might be bad; one only allows what is known to be good.

    For a good introduction to the security issues and reasons why this is accepted wisdom, check out ovid's CGI course.

    One point is that the CGI form passing mechanism escapes out most of the "nasty" characters, they are turned into %xx where xx is a hex number. These are turned back into the original characters by CGI.pm's escapeHTML function - transparently.

    Same documentation also advises that "if you manually change the charset, either by calling the charset() method explicitly or by passing a -charset argument to header(), then all characters will be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings."

    This is a caveat for those using UTF-8 encoding. CGI.pm predates this functionality in perl.

    Hope this helps

    --

    Oh Lord, won’t you burn me a Knoppix CD ?
    My friends all rate Windows, I must disagree.
    Your powers of persuasion will set them all free,
    So oh Lord, won’t you burn me a Knoppix CD ?
    (Missquoting Janis Joplin)

Re: apostrophes and security
by ysth (Canon) on Aug 15, 2005 at 12:58 UTC
    But these (for a true apostrophe) fail completely:
    $_value =~ s/’/&rsquo;/g; # typed from keyboard $_value =~ s/%92/&rsquo;/g; # uri encoding $_value =~ s/&rsquo;/&rsquo;/g; # should never work
    Try: $_value =~ s/\x92/&rsquo;/g; though I'm surprised your first didn't work.
      YES!

      Exactly what I was looking for! It just never occurred to me to look into how perl deals with hex despite URI encoding being all about it...

      THANK YOU!

      DN
Re: apostrophes and security
by Roger (Parson) on Aug 15, 2005 at 12:23 UTC
Re: apostrophes and security
by trammell (Priest) on Aug 15, 2005 at 15:15 UTC
    One method of deleting all undesirable characters:
    $value =~ tr/A-Za-z0-9_//cd; # delete non-\w chars
      Thanks, trammell, but that regex kills ampersands and semicolons, both necessary for what I want to do.
Re: apostrophes and security
by wfsp (Abbot) on Aug 18, 2005 at 11:42 UTC
    Sorry for coming in late to this.

    I had a not disimilar task myself here.

    Note the points made about the character range x80-x9F.

    Hope this helps,

    John