frankus has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to find a Regex that works by excluding the control characters from the data.

I could list the characters I want included, but the file is 90K lines and could have valid characters I didn't anticipate.

I need a regex that targets the ascii characters that corrupt the data, anyone got a regex to suit?

Brother Frankus.

  • Comment on I need a Regex to get my Dirty Data Whiter.

Replies are listed 'Best First'.
Re: I need a Regex to get my Dirty Data Whiter.
by davorg (Chancellor) on Aug 03, 2000 at 14:59 UTC

    Given that most printable characters appear in consecutive runs in most popular character sets, it's easy to build a character class which only includes them. for example, for 7-bit ASCII you can write something like this to only print character between the space (0x20) and the tilde (0x7E).

    while (<>) { s/[^ -~]//g; print; }
    --
    <http://www.dave.org.uk>

    European Perl Conference - Sept 22/24 2000, ICA, London
    <http://www.yapc.org/Europe/>
Re: I need a Regex to get my Dirty Data Whiter.
by frankus (Priest) on Aug 03, 2000 at 15:26 UTC
    This looks like a good idea, and the response was lightning fast. Thanks.

    Not wishing to split hairs with a luminary and abbot but doesn't the script remove the characters within the range?

    Brother Frankus.

      The ^ inside the character class makes it negative. So Dave's regex says every character which is not between a space and a tilde should be replaced with nothing.
Re: I need a Regex to get my Dirty Data Whiter.
by frankus (Priest) on Aug 03, 2000 at 16:06 UTC
    There! I knew there was a reason for keeping my mouth shut :-)

    Thanx for taking time to enlighten a muppet perler like me.

    Brother Frankus.