noobie control char removal

desertman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: noobie control char removal by colwellj (Monk) on Nov 18, 2009 at 23:14 UTC
I think you could use a reg ex. Do you know what the codes are? If its only a few expected ones you can just substitute them out. or possibly something like this (untested) `s/[:cntrl:]//g;` [download] check perlre for more info	[reply] [d/l]
Re^2: noobie control char removal by desertman (Acolyte) on Nov 18, 2009 at 23:41 UTC
I tried that, it took out all the c,n, t, r characters.	[reply]
Re^3: noobie control char removal by ikegami (Patriarch) on Nov 18, 2009 at 23:52 UTC
He meant `s/[[:cntrl:]]//g;` [download] POSIX classes must be used inside a regex character class.	[reply] [d/l]
Re^3: noobie control char removal by colwellj (Monk) on Nov 18, 2009 at 23:52 UTC
Can you edit your post to add the data you are having trouble with? Also you can try the unicode version; `s/\p{IsCntrl}//g;` [download]	[reply] [d/l]
Re: noobie control char removal by ikegami (Patriarch) on Nov 18, 2009 at 23:16 UTC
I doubt they're control chars. Most likely, it's because you don't have a font installed that can handle that character. It could also represent some kind of error (e.g. a malformed character or the wrong encoding is being used by notepad). If you want to delete the characters for which you have no font support, you'll have to be more specific concerning what those characters are.	[reply]
Re^2: noobie control char removal by desertman (Acolyte) on Nov 18, 2009 at 23:45 UTC
any guidance on how to do that?	[reply]
Re^3: noobie control char removal by ikegami (Patriarch) on Nov 18, 2009 at 23:54 UTC
No idea, short of printing out every character. There's millions of them, though, so going through the list could take time. Isn't that kind of arbitrary? Why would you remove characters if you have no idea what those characters are? It would make more sense to find out what the character is and add support for it. You could do that as follows: `open(my $fh, '<:encoding(UTF-8)', $ARGV[0]) or die("Can't open input file \"$ARGV[0]\": $!\n"); $_ = do { local $/; <$fh> }; s/([^\x0A\x20-\x7E])/ sprintf '<U+%04X>', $1 /eg; print;` [download] My name is Éric. I don't speak 日本語. would show up as `My name is <U+00C9>ric. I don't speak <U+65E5><U+672C><U+8A9E>.` [download] (Replace the encoding as appropriate.) Update: Added means of identifying characters.	[reply] [d/l] [select]
Re: noobie control char removal by ww (Archbishop) on Nov 19, 2009 at 03:54 UTC
Since you're on Windows ("in notepad"), one possibility is that you're working with an MSWord .doc containing 'smart quotes' and the like (special attention to hyphens and apostrophes). If so, and if you've processed the document through a script which writes the result to a .txt file (or in certain other ways), you'll see "empty boxes" in the processed data in Notepad but the unprocessed document will render with the intended chars when opened in Word.	[reply]
Re: noobie control char removal by graff (Chancellor) on Nov 19, 2009 at 22:47 UTC
Since you are using notepad, it's likely that the file is just plain text (nothing freaky like ms-word doc, excel spreadsheet, or other hybrid binary/text thing), and as suggested above the "empty boxes" can be either "control" characters, or "real" characters that happen to be outside the range covered by whatever font notepad is using. To get a picture of the byte values in the file (to see what might be causing those empty boxes), you could just do this: `#!/usr/bin/perl while (<>) { chomp; $c{$_}++ for (split //); } printf("%02x : %s : %d\n",ord($_),$_,$c{$_}) for(sort keys %c);` [download] If you run that script on your text file and save the output to some other file, like this: `perl that_script < your_file.txt > char_list.txt` [download] you can then look at the "char_list.txt" file to see which hex byte values occur in the data and show up as empty boxes in notepad. If the file happens to be utf8 unicode, you might try this other tool, which I posted here a while back: unichist -- count/summarize characters in data Run it like this: `perl unichist -x < your_file.txt > char_list.txt` [download] and look at that output with notepad. (Actually, you'll want to modify the "unichist" script so that it does `print "\x{feff}\n";` before doing anything else -- this will put the "byte-order-mark" (BOM) character at the start of the output file, which will tell notepad to treat the file as utf8 data.) Once you know what byte/character values are causing the empty boxes, you'll be able to decide how to fix or remove them.	[reply] [d/l] [select]