C:\>chcp 1252
Active code page: 1252

C:\>type Windows-1252.txt
Everyone seems to have lept to the assumption that your "text file
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
in the Windows-1252 character encoding. The former is a multi-byte
encoding and the latter is a single-byte encoding. The difference is
fundamental. So, first, you need to know whether your text file is in
some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if m/—/" Windows-1252.txt
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if m/\x97/" Windows-1252.txt
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if /\x{2014}/" Windows-1252.txt

C:\>perl -ne "print if /\N{U+2014}/" Windows-1252.txt

C:\>perl -mcharnames=:full -ne "print if /\N{EM DASH}/" Windows-1252.txt

C:\>perl -ne "print if m/—/" UTF-8.txt

C:\>perl -ne "print if m/\x97/" UTF-8.txt

C:\>chcp 65001
Active code page: 65001

C:\>type UTF-8.txt
Everyone seems to have lept to the assumption that your "text file
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
in the Windows-1252 character encoding. The former is a multi-byte
encoding and the latter is a single-byte encoding. The difference is
fundamental. So, first, you need to know whether your text file is in
some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252
character encoding—or even possibly in some other legacy encoding.

C:\>perl -CiO -ne "print if m/—/" UTF-8.txt

C:\>perl -CiO -ne "print if m/\x97/" UTF-8.txt

C:\>perl -CiO -ne "use utf8; print if m/—/" UTF-8.txt
Malformed UTF-8 character (unexpected continuation byte 0x97, with no preceding start byte) at -e line 1.

C:\>perl -CiO -ne "print if m/\x{2014}/" UTF-8.txt
with some weird characters" in it—an em dash, which is not so weird

really—is in the Unicode coded character set. It may be, or it may be

character encoding—or even possibly in some other legacy encoding.


C:\>perl -CiO -ne "print if m/\N{U+2014}/" UTF-8.txt

C:\>perl -mcharnames=:full -CiO -ne "print if m/\N{EM DASH}/" UTF-8.txt
with some weird characters" in it—an em dash, which is not so weird

really—is in the Unicode coded character set. It may be, or it may be

character encoding—or even possibly in some other legacy encoding.


C:\>