Detecting Strange Characters in Text?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Detecting Strange Characters in Text? by jacques (Priest) on Jun 16, 2005 at 17:10 UTC
I am probably over-reliant on this one regexe that converts Latin1 data to utf8: `s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;` [download] You can modify it for your means. I think you can also try HTML::Entities.	[reply] [d/l]
Re: Detecting Strange Characters in Text? by Fletch (Bishop) on Jun 16, 2005 at 17:08 UTC
Technically ASCII is 7-bit, so you can't have an ASCII character with a decimal value greater than 127 (`DEL`)</pedant> At any rate you could always use `tr///` to convert or delete them all to a printable character; or perhaps a `s///e` if you wanted to get fancier and substitue say "`0x##`" instead. See `perldoc perlop` and/or `perldoc perlretut`. -- We're looking for people in ATL	[reply] [d/l] [select]
Re^2: Detecting Strange Characters in Text? by jfroebe (Parson) on Jun 16, 2005 at 17:59 UTC
Correct :) However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit) Jason L. Froebe Team Sybase member No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1	[reply]
Re^3: Detecting Strange Characters in Text? by jhourcle (Prior) on Jun 17, 2005 at 03:36 UTC
However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit) `<pedantic_mode>` Which industry? ASCII is 7 bit, as specified in ANSI X3.4-1986. There are a number of 8 bit character sets that are rather similar to ASCII in their first 128 characters, but there is no one official 'extended ASCII'. There are extended versions of ASCII, such as Latin-1, MacRoman, Windows-1252, etc, but not a single one of them is consistent with each other, and not a single one of them is ASCII. Calling Windows-1252 the 'industry ASCII extension', because it has all of the ASCII characters would be like calling Spanglish the 'standard English extension'. What about Australian? Chicano? Texan? Yes, they all have common roots, and many similarities, and if you knew some other dialect, you could probably figure out most of what the other person was saying, but there is no one that can claim to be the primary extension. `</pedantic_mode>` (this rant comes from years of dealing with e-mail support, and having to deal with people putting 'smiley face' characters in the subject line, which was did bad things to an ANSI terminal or modems with software flow control, and then having to deal with it all over again, when netscape and IE decided that '&#xxx;' was a good way to represent characters, never mind that Mac, Unix, and Windows machines all displayed different characters unless you stuck to specific ranges ... but MS Word can 'save as HTML' and you can keep your curly quotes! (so long as you're the one who looks at the page, so you'll never understand that other people aren't seeing the same thing displayed on their screen).)	[reply] [d/l] [select]
Re: Detecting Strange Characters in Text? by Transient (Hermit) on Jun 16, 2005 at 17:01 UTC
You probably want to decide what you want to keep, rather than what you want to throw away, as the latter will probably be huge. See perlre. You can use a regexp to pull out a range of acceptable ascii values.	[reply]
Re^2: Detecting Strange Characters in Text? by Anonymous Monk on Jun 16, 2005 at 17:07 UTC
Actually I think in this case the allowable set will probably be larger. So to get rid of the 225, I can just run something like `$text =~ s/\xE1//g;` ?	[reply] [d/l]
Re^3: Detecting Strange Characters in Text? by Transient (Hermit) on Jun 16, 2005 at 17:16 UTC
`#!/usr/bin/perl use warnings; use strict; my $text = "ßeta"; print $text, "\n"; $text =~ s/\xDF//; print $text, "\n"; __OUTPUT__ ßeta eta` [download] Although it would be more robust to follow something similar to what jacques suggested.	[reply] [d/l]
Re: Detecting Strange Characters in Text? by TedPride (Priest) on Jun 16, 2005 at 21:54 UTC
The following list of codes is missing a few, and I've only included the ones for ASCII values 128+, but it should give you an idea of how to do this. Just add or remove codes as necessary. I generated my listing by parsing a random "HTML Character Codes" Google search result. use strict; use warnings; my %codes = ( '128' => ['Ä', 'ä'], '129' => ['Å', 'å'], + '130' => ['Ç', 'ç'], '131' => ['É', 'é'] +, '132' => ['Ñ', 'ñ'], '133' => ['Ö', 'ö'], '134' => ['Ü', 'ü'], '135' => ['á', 'á'] +, '136' => ['à', 'à'], '137' => ['â', 'â'], + '138' => ['ä', 'ä'], '139' => ['ã', 'ã'] +, '140' => ['å', 'å'], '141' => ['ç', 'ç'] +, '142' => ['é', 'é'], '143' => ['è', 'è'] +, '144' => ['ê', 'ê'], '145' => ['ë', 'ë'], '146' => ['í', 'í'], '147' => ['ì', 'ì'] +, '148' => ['î', 'î'], '149' => ['ï', 'ï'], '150' => ['ñ', 'ñ'], '151' => ['ó', 'ó'] +, '152' => ['ò', 'ò'], '153' => ['ô', 'ô'], + '154' => ['ö', 'ö'], '155' => ['õ', 'õ'] +, '156' => ['ú', 'ú'], '157' => ['ù', 'ù'] +, '158' => ['û', 'û'], '159' => ['ü', 'ü'], '160' => ['†', '&dagger;'], '161' => ['ϒ', '&upsih;'], + '162' => ['′', '′'], '163' => ['£', '£'], + '164' => ['§', '§'], '165' => ['•', '•'], + '166' => ['¶', '¶'], '167' => ['♣', '&clubs;'] +, '168' => ['♦', '&diams;'], '169' => ['♥', '&hearts;' +], '170' => ['♠', '&spades;'], '171' => ['↔', '↔'], + '172' => ['←', '←'], '173' => ['≠', '≠'], '174' => ['→', '→'], '175' => ['↓', '↓'], + '176' => ['∞', '∞'], '177' => ['±', '±'] +, '178' => ['≤', '≤'], '179' => ['≥', '≥'], '180' => ['×', '×'], '181' => ['∝', '&prop;'], + '182' => ['∂', '∂'], '183' => ['∑', '∑'], '184' => ['∏', '∏'], '185' => ['π', 'π'], '186' => ['≡', '&equiv;'], '187' => ['ª', 'ª'], '188' => ['º', 'º'], '189' => ['Ω', 'ω'], + '190' => ['æ', 'æ'], '191' => ['↵', '&crarr;'] +, '192' => ['ℵ', '&alefsym;'], '193' => ['ℑ', '&image;'] +, '194' => ['ℜ', '&real;'], '195' => ['√', '√'] +, '196' => ['⊗', '&otimes;'], '197' => ['⊕', '&oplus;'] +, '198' => ['∅', '∅'], '199' => ['∩', '∩'], '200' => ['∪', '∪'], '201' => ['⊃', '⊃'], '202' => [' ', ' '], '203' => ['⊄', '&nsub;'], + '204' => ['⊂', '⊂'], '205' => ['⊆', '&sube;'], + '206' => ['∈', '∈'], '207' => ['∉', '∉'] +, '208' => ['∠', '&ang;'], '209' => ['∇', '∇'] +, '210' => ['“', '“'], '211' => ['”', '”'] +, '212' => ['‘', '‘'], '213' => ['’', '’'] +, '214' => ['÷', '÷'], '215' => ['◊', '&loz;'], '216' => ['ÿ', 'ÿ'], '217' => ['∧', '&and;'], '218' => ['∨', '&or;'], '219' => ['⇔', '↔'], + '220' => ['⇐', '←'], '221' => ['⇑', '↑'], + '222' => ['⇒', '→'], '223' => ['⇓', '↓'], + '224' => ['‡', '&dagger;'], '225' => ['〈', '&lang;'], + '226' => ['‚', '&sbquo;'], '227' => ['„', '&bdquo;'] +, '228' => ['‰', '&permil;'], '229' => ['Â', 'â'], + '230' => ['Ê', 'ê'], '231' => ['Á', 'á'] +, '232' => ['Ë', 'ë'], '233' => ['È', 'è'] +, '234' => ['Í', 'í'], '235' => ['Î', 'î'], + '236' => ['Ï', 'ï'], '237' => ['Ì', 'ì'] +, '238' => ['Ó', 'ó'], '239' => ['Ô', 'ô'], + '241' => ['〉', '&rang;'], '242' => ['Ú', 'ú'] +, '243' => ['Û', 'û'], '244' => ['Ù', 'ù'] +, '246' => ['ˆ', '&circ;'], '247' => ['˜', '&tilde;'], + '248' => ['¯', '¯'], '252' => ['¸', '¸'], + '255' => ['š', '&scaron;'] ); my $text = '™£¢??§¶•ª'; $text =~ s/([\x80-\xFF])/$codes{ord($1)}[1]/g; print $text; [download]	[reply] [d/l]
Re^2: Detecting Strange Characters in Text? by Anonymous Monk on Jun 16, 2005 at 22:06 UTC
HTML::Entities	[reply]