in reply to regexing for non-standard characters...

Assuming you properly decoded your input,

how does one find out what this stupid thing is

printf("chr(%d)\n", ord($ch)); # chr(8212) printf("chr(0x%04X)\n", ord($ch)); # chr(0x2014) printf("\"\\x{%04X}\"\n", ord($ch)); # "\x{2014}" printf("\"\\N{U+%04X}\"\n", ord($ch)); # "\N{U+2014}" use charnames (); printf("\"\\N{%s}\"\n", charnames::viacode(ord($ch))); # "\N{EM DASH}"

how to regex for it?

$word =~ /\x{2014}/ $word =~ /\N{U+2014}/ use charnames ':full'; $word =~ /\N{EM DASH}/ use utf8; $word =~ /—/ # Encoded as UTF-8 in the source

Update: Added crashtest's solution.

Replies are listed 'Best First'.
Re^2: regexing for non-standard characters...
by crashtest (Curate) on Apr 15, 2010 at 23:06 UTC

    An extra alternative:

    use charnames ':full'; ... print (charnames::viacode( ord($ch))); # EM DASH
    Which is of course nothing you couldn't figure out using a Unicode lookup table (like "Character Map" on Windows) once you know the code point.