Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a CGI script which has a text field. Users enter some text into the field, and the script writes it to a file. Plain and simple. However, one of my users apparently copied and pasted some text that had some odd characters in it which later caused problems for another script that reads the files. I ran an "od -c <filename>" on the file and found it had ascii code 225 characters in it (see http://www.lookuptables.com/ for an ascii chart; 225 is the beta symbol in the extended table). I'd like to adjust my script to check for that character (or others), but without making it dependent on od being installed on the system. How can I check a block of text for a particular ascii code like 225?

Replies are listed 'Best First'.
Re: Detecting Strange Characters in Text?
by jacques (Priest) on Jun 16, 2005 at 17:10 UTC
    I am probably over-reliant on this one regexe that converts Latin1 data to utf8:
    s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;
    You can modify it for your means. I think you can also try HTML::Entities.
Re: Detecting Strange Characters in Text?
by Fletch (Bishop) on Jun 16, 2005 at 17:08 UTC

    Technically ASCII is 7-bit, so you can't have an ASCII character with a decimal value greater than 127 (DEL)</pedant>

    At any rate you could always use tr/// to convert or delete them all to a printable character; or perhaps a s///e if you wanted to get fancier and substitue say "0x##" instead. See perldoc perlop and/or perldoc perlretut.

    --
    We're looking for people in ATL

      Correct :) However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit)

      Jason L. Froebe

      Team Sybase member

      No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

        However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit)
        <pedantic_mode>

        Which industry?

        ASCII is 7 bit, as specified in ANSI X3.4-1986.

        There are a number of 8 bit character sets that are rather similar to ASCII in their first 128 characters, but there is no one official 'extended ASCII'. There are extended versions of ASCII, such as Latin-1, MacRoman, Windows-1252, etc, but not a single one of them is consistent with each other, and not a single one of them is ASCII.

        Calling Windows-1252 the 'industry ASCII extension', because it has all of the ASCII characters would be like calling Spanglish the 'standard English extension'. What about Australian? Chicano? Texan? Yes, they all have common roots, and many similarities, and if you knew some other dialect, you could probably figure out most of what the other person was saying, but there is no one that can claim to be the primary extension.

        </pedantic_mode>

        (this rant comes from years of dealing with e-mail support, and having to deal with people putting 'smiley face' characters in the subject line, which was did bad things to an ANSI terminal or modems with software flow control, and then having to deal with it all over again, when netscape and IE decided that '&#xxx;' was a good way to represent characters, never mind that Mac, Unix, and Windows machines all displayed different characters unless you stuck to specific ranges ... but MS Word can 'save as HTML' and you can keep your curly quotes! (so long as you're the one who looks at the page, so you'll never understand that other people aren't seeing the same thing displayed on their screen).)

Re: Detecting Strange Characters in Text?
by Transient (Hermit) on Jun 16, 2005 at 17:01 UTC
    You probably want to decide what you want to keep, rather than what you want to throw away, as the latter will probably be huge.

    See perlre. You can use a regexp to pull out a range of acceptable ascii values.
      Actually I think in this case the allowable set will probably be larger. So to get rid of the 225, I can just run something like
      $text =~ s/\xE1//g;
      ?
        #!/usr/bin/perl use warnings; use strict; my $text = "ßeta"; print $text, "\n"; $text =~ s/\xDF//; print $text, "\n"; __OUTPUT__ ßeta eta
        Although it would be more robust to follow something similar to what jacques suggested.
Re: Detecting Strange Characters in Text?
by TedPride (Priest) on Jun 16, 2005 at 21:54 UTC
    The following list of codes is missing a few, and I've only included the ones for ASCII values 128+, but it should give you an idea of how to do this. Just add or remove codes as necessary. I generated my listing by parsing a random "HTML Character Codes" Google search result.
    use strict; use warnings; my %codes = ( '128' => ['&#196;', '&auml;'], '129' => ['&#197;', '&aring;'], + '130' => ['&#199;', '&ccedil;'], '131' => ['&#201;', '&eacute;'] +, '132' => ['&#209;', '&ntilde;'], '133' => ['&#214;', '&ouml;'], '134' => ['&#220;', '&uuml;'], '135' => ['&#225;', '&aacute;'] +, '136' => ['&#224;', '&agrave;'], '137' => ['&#226;', '&acirc;'], + '138' => ['&#228;', '&auml;'], '139' => ['&#227;', '&atilde;'] +, '140' => ['&#229;', '&aring;'], '141' => ['&#231;', '&ccedil;'] +, '142' => ['&#233;', '&eacute;'], '143' => ['&#232;', '&egrave;'] +, '144' => ['&#234;', '&ecirc;'], '145' => ['&#235;', '&euml;'], '146' => ['&#237;', '&iacute;'], '147' => ['&#236;', '&igrave;'] +, '148' => ['&#238;', '&icirc;'], '149' => ['&#239;', '&iuml;'], '150' => ['&#241;', '&ntilde;'], '151' => ['&#243;', '&oacute;'] +, '152' => ['&#242;', '&ograve;'], '153' => ['&#244;', '&ocirc;'], + '154' => ['&#246;', '&ouml;'], '155' => ['&#245;', '&otilde;'] +, '156' => ['&#250;', '&uacute;'], '157' => ['&#249;', '&ugrave;'] +, '158' => ['&#251;', '&ucirc;'], '159' => ['&#252;', '&uuml;'], '160' => ['&#8224;', '&dagger;'], '161' => ['&#978;', '&upsih;'], + '162' => ['&#8242;', '&prime;'], '163' => ['&#163;', '&pound;'], + '164' => ['&#167;', '&sect;'], '165' => ['&#8226;', '&bull;'], + '166' => ['&#182;', '&para;'], '167' => ['&#9827;', '&clubs;'] +, '168' => ['&#9830;', '&diams;'], '169' => ['&#9829;', '&hearts;' +], '170' => ['&#9824;', '&spades;'], '171' => ['&#8596;', '&harr;'], + '172' => ['&#8592;', '&larr;'], '173' => ['&#8800;', '&ne;'], '174' => ['&#8594;', '&rarr;'], '175' => ['&#8595;', '&darr;'], + '176' => ['&#8734;', '&infin;'], '177' => ['&#177;', '&plusmn;'] +, '178' => ['&#8804;', '&le;'], '179' => ['&#8805;', '&ge;'], '180' => ['&#215;', '&times;'], '181' => ['&#8733;', '&prop;'], + '182' => ['&#8706;', '&part;'], '183' => ['&#8721;', '&sum;'], '184' => ['&#8719;', '&prod;'], '185' => ['&#960;', '&pi;'], '186' => ['&#8801;', '&equiv;'], '187' => ['&#170;', '&ordf;'], '188' => ['&#186;', '&ordm;'], '189' => ['&#937;', '&omega;'], + '190' => ['&#230;', '&aelig;'], '191' => ['&#8629;', '&crarr;'] +, '192' => ['&#8501;', '&alefsym;'], '193' => ['&#8465;', '&image;'] +, '194' => ['&#8476;', '&real;'], '195' => ['&#8730;', '&radic;'] +, '196' => ['&#8855;', '&otimes;'], '197' => ['&#8853;', '&oplus;'] +, '198' => ['&#8709;', '&empty;'], '199' => ['&#8745;', '&cap;'], '200' => ['&#8746;', '&cup;'], '201' => ['&#8835;', '&sup;'], '202' => ['&#160;', '&nbsp;'], '203' => ['&#8836;', '&nsub;'], + '204' => ['&#8834;', '&sub;'], '205' => ['&#8838;', '&sube;'], + '206' => ['&#8712;', '&isin;'], '207' => ['&#8713;', '&notin;'] +, '208' => ['&#8736;', '&ang;'], '209' => ['&#8711;', '&nabla;'] +, '210' => ['&#8220;', '&ldquo;'], '211' => ['&#8221;', '&rdquo;'] +, '212' => ['&#8216;', '&lsquo;'], '213' => ['&#8217;', '&rsquo;'] +, '214' => ['&#247;', '&divide;'], '215' => ['&#9674;', '&loz;'], '216' => ['&#255;', '&yuml;'], '217' => ['&#8743;', '&and;'], '218' => ['&#8744;', '&or;'], '219' => ['&#8660;', '&harr;'], + '220' => ['&#8656;', '&larr;'], '221' => ['&#8657;', '&uarr;'], + '222' => ['&#8658;', '&rarr;'], '223' => ['&#8659;', '&darr;'], + '224' => ['&#8225;', '&dagger;'], '225' => ['&#9001;', '&lang;'], + '226' => ['&#8218;', '&sbquo;'], '227' => ['&#8222;', '&bdquo;'] +, '228' => ['&#8240;', '&permil;'], '229' => ['&#194;', '&acirc;'], + '230' => ['&#202;', '&ecirc;'], '231' => ['&#193;', '&aacute;'] +, '232' => ['&#203;', '&euml;'], '233' => ['&#200;', '&egrave;'] +, '234' => ['&#205;', '&iacute;'], '235' => ['&#206;', '&icirc;'], + '236' => ['&#207;', '&iuml;'], '237' => ['&#204;', '&igrave;'] +, '238' => ['&#211;', '&oacute;'], '239' => ['&#212;', '&ocirc;'], + '241' => ['&#9002;', '&rang;'], '242' => ['&#218;', '&uacute;'] +, '243' => ['&#219;', '&ucirc;'], '244' => ['&#217;', '&ugrave;'] +, '246' => ['&#710;', '&circ;'], '247' => ['&#732;', '&tilde;'], + '248' => ['&#175;', '&macr;'], '252' => ['&#184;', '&cedil;'], + '255' => ['&#353;', '&scaron;'] ); my $text = '™£¢??§¶•ª'; $text =~ s/([\x80-\xFF])/$codes{ord($1)}[1]/g; print $text;