Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a flat file that contains non-standard ASCII characters (by which I mean <32 or >127). Is there a more elegant way than the way I am doing it now?
# Clean up funny escape characters for $f(0..31) { $funnychar = chr($f); s/$funnychar//g; } for $f(127..255) { $funnychar = chr($f); s/$funnychar//g; }

Replies are listed 'Best First'.
Re: Getting rid of non-standard ASCII characters
by Zaxo (Archbishop) on Dec 19, 2002 at 02:03 UTC

    Transliterate: tr/\0-\037\177-\377//d; Btw, all characters less than 128 are ASCII, you wanted to eliminate control characters - which includes newline, carriage return and tab. 'Printable ASCII' describes what you want to keep.

    Update: Ionizor, you're wrong. There are lots of extended 8-bit character sets and ms code pages, but none are ASCII, which is 7-bit.

    Update2: pg, it compiles for me, but I did err in placing leading zeros in the escaped octals, to make more than three octal digits. Repaired, and thanks.

    After Compline,
    Zaxo

      Technically ASCII is 8 bit, so all characters less than 256 are ASCII. I believe 128 - 255 are all printable, so "Printable 7-bit ASCII" is probably more accurate.

      Nitpick, nitpick, nitpick, I know...

      Update: I'm wrong. These people are too. An explanation of ISO 646 I found here pretty much sums it up: "ASCII uses only 7 bits and allows the most significant eighth bit to be used as parity bit, highlight bit, end-of-string bit (all of which are considered bad practice nowadays) or to include additional characters for internationalization (i18n for which we need 8bit-clean programs that do none of afore-mentioned silly tricks) but ASCII defined no standard for this and many manufacturers invented their own proprietary codepages." Sorry.

      You said: "tr/\0-\037\0177-\0377//d;"


      Maybe you didn't test your solution throughly:-) A quick fix, including tester, could be:
      for (0..255) { $s .= chr(); } $s =~ tr/\0-\37\177-\377//d; # fixed, no more leading zeroes print $s;
Re: Getting rid of non-standard ASCII characters
by BrowserUk (Patriarch) on Dec 19, 2002 at 02:04 UTC

    $s .= chr for 1..255 print $s ☺☻♥♦ ♫☼►◄↕‼¶§▬↨↠+‘↓→←∟↔▲▼ !"#$%&'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnopqrstuvwxyz{|}~⌂ ÇüéâäàåçêëèïîìÄÅà +æ ÆôöòûùÿÖÜ¢£¥₧ƒáí +óúñѪº¿⌐¬½¼¡«»░â +–’▓│┤╡╢╖╕╣â +‘╗╝╜�›┐└┴┬ +├─┼╞╟╚╔╩╦â + â•â•¬â•§â•¨â•¤â•¥â•™â•˜â•’ +•“╫╪┘┌█▄▌▐■+€ αßΓπΣσµτΦΘΩδ∞φε∠+©â‰¡Â±â‰¥â‰¤âŒ âŒ¡Ã·â‰ˆÂ°âˆ™Â·âˆ +šâ¿Â²â– Â  $s =~ tr/\x20-\x7f//cd print $s !"#$%&'()*+,-./0123456789:;<=>?@ ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` abcdefghijklmnopqrstuvwxyz{|}~

    Ouput wrapped for posting.


    Examine what is said, not who speaks.

      Ah, nice. Something was bothering me about the previously offered transliteration, but I couldn't put my finger on it.

      Makeshifts last the longest.

Re: Getting rid of non-standard ASCII characters
by pg (Canon) on Dec 19, 2002 at 02:07 UTC
    @range = map {chr()} (0, 31, 128, 255); $s =~ s/[$range[0]-$range[1]|$range[2]-$range[3]]//g;
Re: Getting rid of non-standard ASCII characters
by skx (Parson) on Dec 19, 2002 at 10:59 UTC

     WIthout worrying about the ranges involved directly why not just remove 'unprintable' characters like this:

    $text =~ s/[[:^print:]]//g
    Steve
    ---
    steve.org.uk
      Usually because of efficiency. If you resort to a pattern at least make it s/[[:^print:]]+//g

      Makeshifts last the longest.