Re^3: regexing for non-standard characters...

Google "Unicode coded character set". Read the definition of Coded Character Set in the Unicode Consortium's Glossary of Unicode Terms. I suspect you're familiar with the term "coded character set" and that you're making some point about my usage of it here, but your point eludes me. I think it's correct usage to write "Unicode coded character set."

I've read and reread your node and I earnestly don't understand the esoteric points you seem to be making about my phraseology. I especially don't understand your seemingly blantantly incorrect assertion that "Unicode is the only character set understood by Perl…and its regex engine." I've reread my own node and I think it's both correct and potentially helpful to emmiesix, who might be working with Perl on a computer running Microsoft Windows.

You probably noticed my intentional use of several em dashes in my node. I'll use the text of my post to demonstrate…well…the sense of my post.

C:\>chcp 1252
Active code page: 1252

C:\>type Windows-1252.txt
Everyone seems to have lept to the assumption that your "text file
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
in the Windows-1252 character encoding. The former is a multi-byte
encoding and the latter is a single-byte encoding. The difference is
fundamental. So, first, you need to know whether your text file is in
some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if m/—/" Windows-1252.txt
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if m/\x97/" Windows-1252.txt
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
character encoding—or even possibly in some other legacy encoding.

C:\>perl -ne "print if /\x{2014}/" Windows-1252.txt

C:\>perl -ne "print if /\N{U+2014}/" Windows-1252.txt

C:\>perl -mcharnames=:full -ne "print if /\N{EM DASH}/" Windows-1252.t
+xt

C:\>perl -ne "print if m/—/" UTF-8.txt

C:\>perl -ne "print if m/\x97/" UTF-8.txt

C:\>chcp 65001
Active code page: 65001

C:\>type UTF-8.txt
Everyone seems to have lept to the assumption that your "text file
with some weird characters" in it—an em dash, which is not so weird
really—is in the Unicode coded character set. It may be, or it may be
in the Windows-1252 character encoding. The former is a multi-byte
encoding and the latter is a single-byte encoding. The difference is
fundamental. So, first, you need to know whether your text file is in
some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252
character encoding—or even possibly in some other legacy encoding.

C:\>perl -CiO -ne "print if m/—/" UTF-8.txt

C:\>perl -CiO -ne "print if m/\x97/" UTF-8.txt

C:\>perl -CiO -ne "use utf8; print if m/—/" UTF-8.txt
Malformed UTF-8 character (unexpected continuation byte 0x97, with no 
+preceding start byte) at -e line 1.

C:\>perl -CiO -ne "print if m/\x{2014}/" UTF-8.txt
with some weird characters" in it—an em dash, which is not so weird

really—is in the Unicode coded character set. It may be, or it may be

character encoding—or even possibly in some other legacy encoding.


C:\>perl -CiO -ne "print if m/\N{U+2014}/" UTF-8.txt

C:\>perl -mcharnames=:full -CiO -ne "print if m/\N{EM DASH}/" UTF-8.tx
+t
with some weird characters" in it—an em dash, which is not so weird

really—is in the Unicode coded character set. It may be, or it may be

character encoding—or even possibly in some other legacy encoding.


C:\>
[download]

No matter what I did to try to type a UTF-8 em dash into the Windows Command Prompt, it seems it wouldn't let me regardless of the code page setting. (Code page 65001 is UTF-8.) This is some defect of Windows, no doubt. I also don't know why the UTF-8 output appears double-spaced or why m/\x{2014}/ matches but m/\N{U+2014}/ doesn't. Maybe you do.

The bottom line—and the point I was making to emmiesix—is that the "weird character" em dash could be one of several different things. What thing it is exactly (i.e., what byte or sequence of bytes) depends on the coded character set (CCS) and the character encoding form (CEF). One has to know if the text is Unicode (CCS) UTF-8 (CEF) or Windows-1252 (both CCS and CEF) or something else.

Comment on Re^3: regexing for non-standard characters... Download Code

Replies are listed 'Best First'.
Re^4: regexing for non-standard characters... by ikegami (Patriarch) on Apr 18, 2010 at 20:46 UTC
Read the definition of Coded Character Set in the Unicode Consortium's Glossary of Unicode Terms. Ah ok. So it It means what most people mean by "character set". Adjust to fix accordingly. your point eludes me. You said we all made assumptions about the encoding and/or the character set, but I didn't (I stated $ch needed to be text), rethaew didn't (he assumed $string was text) and graff didn't (he's the only one whose solution dealt with the raw input, and he specifically mentioned how to handle encodings). One has to know if the text is Unicode (CCS) UTF-8 (CEF) or Windows-1252 (both CCS and CEF) or something else. Just the encoding. All the decoding functions produce characters in the Unicode character set, so you don't have to worry about any other CCS. `open(my $fh, "<:encoding($CEF)", $qfn) or die; while (<$fh>) { print("Line $. contains EM DASH\n") if /\x{2014}/; }` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: regexing for non-standard characters...
by ikegami (Patriarch) on Apr 18, 2010 at 20:46 UTC

Read the definition of Coded Character Set in the Unicode Consortium's Glossary of Unicode Terms.

Ah ok. So it It means what most people mean by "character set". Adjust to fix accordingly.

your point eludes me.

You said we all made assumptions about the encoding and/or the character set, but I didn't (I stated $ch needed to be text), rethaew didn't (he assumed $string was text) and graff didn't (he's the only one whose solution dealt with the raw input, and he specifically mentioned how to handle encodings).

One has to know if the text is Unicode (CCS) UTF-8 (CEF) or Windows-1252 (both CCS and CEF) or something else.

Just the encoding. All the decoding functions produce characters in the Unicode character set, so you don't have to worry about any other CCS.

open(my $fh, "<:encoding($CEF)", $qfn) or die;
while (<$fh>) {
   print("Line $. contains EM DASH\n") if /\x{2014}/;
}
[download]

[reply]
[d/l]
[select]