lazybowel has asked for the wisdom of the Perl Monks concerning the following question:

hi,

my question is regarding filtering out bad characters...

for example i have a text file that is full of invalid and unreadable characters and i was wondering if there is a perl module that could go through this file and filter out everything except plain text. or is there another solution

thanks,

Replies are listed 'Best First'.
Re: remove bad characters
by runrig (Abbot) on Jun 26, 2007 at 22:04 UTC
    see perlre and the various character classes. Maybe you want something like: s/[[:^print:]]//g or maybe s/[[:cntrl]]//g
Re: remove bad characters
by chrism01 (Friar) on Jun 26, 2007 at 22:53 UTC

      Just a note: Using binmode may be important for the sake of portability. Because if you're using only Unix, it is a no-op.

Re: remove bad characters
by graff (Chancellor) on Jun 27, 2007 at 03:34 UTC
    If the previous replies didn't help, you'll need to be more clear about what you consider to be "bad characters", and what falls within your range of "plain text" (this can mean different things to different people).

    If you know anything at all about the character encoding(s) that you are dealing with (what do you have as input, what do you want as output), that information could be relevant for picking a good solution, too.

    Oh, and: What have you tried, and how did it fall short of what you really wanted? Show us some sample data and some code, so we can see where you are.

Re: remove bad characters
by clinton (Priest) on Jun 27, 2007 at 10:29 UTC
    Are you sure that you're not seeing "bad characters" because you're interpreting the file with the incorrect encoding? For instance, the text file may be encoded in UTF-8, but you're reading it as ISO-8859-1.

    You may want to give Encode::Guess a try, to figure out what encoding it is.

    Clint

      thanks for the help guys, this did the trick
      $string =~ s{ ( [^\x00-\x7E] ) }{}xmsg;
      i just have to start practicing using regex, i'm really bad at it
        That's going to leave many control characters in your data. You haven't specified what a "bad character" is, so that might be OK.