desertman has asked for the wisdom of the Perl Monks concerning the following question:

Hello Noble Monks,

I know chomp removes newline, but how do I get rid of all the other control characters that look like little empty boxes in notepad?

desertmonk

Replies are listed 'Best First'.
Re: noobie control char removal
by colwellj (Monk) on Nov 18, 2009 at 23:14 UTC
    I think you could use a reg ex.
    Do you know what the codes are? If its only a few expected ones you can just substitute them out.
    or possibly something like this (untested)
    s/[:cntrl:]//g;
    check perlre for more info
      I tried that, it took out all the c,n, t, r characters.
        He meant
        s/[[:cntrl:]]//g;

        POSIX classes must be used inside a regex character class.

        Can you edit your post to add the data you are having trouble with?
        Also you can try the unicode version;
        s/\p{IsCntrl}//g;
Re: noobie control char removal
by ikegami (Patriarch) on Nov 18, 2009 at 23:16 UTC

    I doubt they're control chars. Most likely, it's because you don't have a font installed that can handle that character. It could also represent some kind of error (e.g. a malformed character or the wrong encoding is being used by notepad).

    If you want to delete the characters for which you have no font support, you'll have to be more specific concerning what those characters are.

      any guidance on how to do that?

        No idea, short of printing out every character. There's millions of them, though, so going through the list could take time.

        Isn't that kind of arbitrary? Why would you remove characters if you have no idea what those characters are? It would make more sense to find out what the character is and add support for it.

        You could do that as follows:

        open(my $fh, '<:encoding(UTF-8)', $ARGV[0]) or die("Can't open input file \"$ARGV[0]\": $!\n"); $_ = do { local $/; <$fh> }; s/([^\x0A\x20-\x7E])/ sprintf '<U+%04X>', $1 /eg; print;
        My name is Éric.
        I don't speak 日本語.
        

        would show up as

        My name is <U+00C9>ric. I don't speak <U+65E5><U+672C><U+8A9E>.

        (Replace the encoding as appropriate.)

        Update: Added means of identifying characters.

Re: noobie control char removal
by ww (Archbishop) on Nov 19, 2009 at 03:54 UTC
    Since you're on Windows ("in notepad"), one possibility is that you're working with an MSWord .doc containing 'smart quotes' and the like (special attention to hyphens and apostrophes).

    If so, and if you've processed the document through a script which writes the result to a .txt file (or in certain other ways), you'll see "empty boxes" in the processed data in Notepad but the unprocessed document will render with the intended chars when opened in Word.

Re: noobie control char removal
by graff (Chancellor) on Nov 19, 2009 at 22:47 UTC
    Since you are using notepad, it's likely that the file is just plain text (nothing freaky like ms-word doc, excel spreadsheet, or other hybrid binary/text thing), and as suggested above the "empty boxes" can be either "control" characters, or "real" characters that happen to be outside the range covered by whatever font notepad is using.

    To get a picture of the byte values in the file (to see what might be causing those empty boxes), you could just do this:

    #!/usr/bin/perl while (<>) { chomp; $c{$_}++ for (split //); } printf("%02x : %s : %d\n",ord($_),$_,$c{$_}) for(sort keys %c);
    If you run that script on your text file and save the output to some other file, like this:
    perl that_script < your_file.txt > char_list.txt
    you can then look at the "char_list.txt" file to see which hex byte values occur in the data and show up as empty boxes in notepad.

    If the file happens to be utf8 unicode, you might try this other tool, which I posted here a while back: unichist -- count/summarize characters in data

    Run it like this:

    perl unichist -x < your_file.txt > char_list.txt
    and look at that output with notepad. (Actually, you'll want to modify the "unichist" script so that it does print "\x{feff}\n"; before doing anything else -- this will put the "byte-order-mark" (BOM) character at the start of the output file, which will tell notepad to treat the file as utf8 data.)

    Once you know what byte/character values are causing the empty boxes, you'll be able to decide how to fix or remove them.