Re: regexing for non-standard characters...

Assuming you properly decoded your input,

how does one find out what this stupid thing is

printf("chr(%d)\n",         ord($ch));   # chr(8212)
printf("chr(0x%04X)\n",     ord($ch));   # chr(0x2014)
printf("\"\\x{%04X}\"\n",   ord($ch));   # "\x{2014}"
printf("\"\\N{U+%04X}\"\n", ord($ch));   # "\N{U+2014}"

use charnames ();
printf("\"\\N{%s}\"\n",
    charnames::viacode(ord($ch)));       # "\N{EM DASH}"
[download]

how to regex for it?

$word =~ /\x{2014}/

$word =~ /\N{U+2014}/

use charnames ':full';
$word =~ /\N{EM DASH}/

use utf8;
$word =~ /—/  # Encoded as UTF-8 in the source
[download]

Update: Added crashtest's solution.

Comment on Re: regexing for non-standard characters... Download Code

Replies are listed 'Best First'.
Re^2: regexing for non-standard characters... by crashtest (Curate) on Apr 15, 2010 at 23:06 UTC
An extra alternative: `use charnames ':full'; ... print (charnames::viacode( ord($ch))); # EM DASH` [download] Which is of course nothing you couldn't figure out using a Unicode lookup table (like "Character Map" on Windows) once you know the code point.	[reply] [d/l]