Hello.

From the world of knows-just-enough-to-be-dangerous: I have a form field on an HTML page where users can enter information. The problem is that they often cut and paste from programs that automatically create special characters like 'smart' quotes (“ & ”), true apostrophes (’) and em dashes (—).

Wisdom on security for user input seems to be that one doesn't block what might be bad; one only allows what is known to be good. So how do I detect these special characters? I'm at a loss! (And I want these special characters to be available. I hate seeing the shorthand for 12 inches (') used as an apostrophe (’).)

This sort of thing is easy enough:
$_value =~ s/>/&gt;/g; $_value =~ s/</&lt;/g;
But these (for a true apostrophe) fail completely:
$_value =~ s/’/&rsquo;/g; # typed from keyboard $_value =~ s/%92/&rsquo;/g; # uri encoding $_value =~ s/&rsquo;/&rsquo;/g; # should never work
Documentation for CGI.pm says that "by default, all HTML that are emitted by the form-generating functions are passed through a function called escapeHTML()." I'm seeing nothing being escaped, tho, and I don't believe I've changed any defaults:
our %_form; our $_value; our $_query = CGI->new(); my @_field_names = $_query->param; foreach (@_field_names) { $_value = $_query->param($_); # convert nasty and/or special chars to html codes $_form{$_} = $_value; }
Same documentation also advises that "if you manually change the charset, either by calling the charset() method explicitly or by passing a -charset argument to header(), then all characters will be replaced by their numeric entities, since CGI.pm has no lookup table for all the possible encodings." (Emphasis is mine.)

That seems a little overkill.

Please advise! Thank you.

xox,
Dead Nancy

p.s.: Sorry if this is covered somewhere obvious, but many hours spent searching this site and the web in general have turned up little.


UPDATE

Ok; the problem with apostrophes (and the like) is solved. Thanks, ysth! Now the security: here's what I've got as the code for converting form data (likely pasted from Word) into something safe to be handled:
foreach (@_field_names) { $_value = $_query->param($_); # convert special chars to html codes $_value =~ s/\x91/&lsquo;/g; # smart quotes $_value =~ s/\x92/&rsquo;/g; $_value =~ s/\x93/&ldquo;/g; $_value =~ s/\x94/&rdquo;/g; $_value =~ s/\x96/&ndash;/g; # dashes $_value =~ s/\x97/&mdash;/g; $_value =~ s/\x7C/&#124;/g; # pipe $_value =~ s/</&lt;/g; # brackets $_value =~ s/>/&gt;/g; $_value =~ s/{/&#123;/g; $_value =~ s/}/&#125;/g; # only allow the known good if ($_value =~ /([\w\s\.\@\&\ \!\'\"\-\,\/\#\:\;\(\)]+)/) { $_value = $1; } else { die("(Friendly error message)"); } $_form{$_} = $_value; }
Is this decent code? Is there some way to compress all those s// statements? (There are many of them, but input here is an artist's statement, and we can be creative.) Am I overlooking something horribly obvious? The if->then statement drops everything after a character it doesn't like, but that's handy for finding the point of trouble. Still, seems like it could be handled better.

Thanks again for the help! Other than the intimidation factor for n00bs, this place is an amazing resource.

DN


UPDATE

Ovid reminds us: “Here’s a good rule to remember: Always trust your users. Never trust their input.” (Thanks, rW.) I learned the hard way tonight while trying things out. I thought I’d covered all bases, but despite a very large allowance for special characters, the very first statement I tried with the code above showed that I missed the dollar sign, asterisk and elipses (‘…’—single character, different from three periods), all used legitimately. I also found that, within the same document, quotes and apostrophes were used both correctly and incorrectly in different places. Now I’m wondering if I should (attempt to) correct mistakes on the part of my users. Is there Typography::Correct?

DN

In reply to apostrophes and security by deadnancy

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.