in reply to Detecting Strange Characters in Text?

Technically ASCII is 7-bit, so you can't have an ASCII character with a decimal value greater than 127 (DEL)</pedant>

At any rate you could always use tr/// to convert or delete them all to a printable character; or perhaps a s///e if you wanted to get fancier and substitue say "0x##" instead. See perldoc perlop and/or perldoc perlretut.

--
We're looking for people in ATL

Replies are listed 'Best First'.
Re^2: Detecting Strange Characters in Text?
by jfroebe (Parson) on Jun 16, 2005 at 17:59 UTC

    Correct :) However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit)

    Jason L. Froebe

    Team Sybase member

    No one has seen what you have seen, and until that happens, we're all going to think that you're nuts. - Jack O'Neil, Stargate SG-1

      However, it was extended unofficially but consistantly. See ASCII for both the ASCII standard (7bit) and the industry ASCII extension (8bit)
      <pedantic_mode>

      Which industry?

      ASCII is 7 bit, as specified in ANSI X3.4-1986.

      There are a number of 8 bit character sets that are rather similar to ASCII in their first 128 characters, but there is no one official 'extended ASCII'. There are extended versions of ASCII, such as Latin-1, MacRoman, Windows-1252, etc, but not a single one of them is consistent with each other, and not a single one of them is ASCII.

      Calling Windows-1252 the 'industry ASCII extension', because it has all of the ASCII characters would be like calling Spanglish the 'standard English extension'. What about Australian? Chicano? Texan? Yes, they all have common roots, and many similarities, and if you knew some other dialect, you could probably figure out most of what the other person was saying, but there is no one that can claim to be the primary extension.

      </pedantic_mode>

      (this rant comes from years of dealing with e-mail support, and having to deal with people putting 'smiley face' characters in the subject line, which was did bad things to an ANSI terminal or modems with software flow control, and then having to deal with it all over again, when netscape and IE decided that '&#xxx;' was a good way to represent characters, never mind that Mac, Unix, and Windows machines all displayed different characters unless you stuck to specific ranges ... but MS Word can 'save as HTML' and you can keep your curly quotes! (so long as you're the one who looks at the page, so you'll never understand that other people aren't seeing the same thing displayed on their screen).)