regexing for non-standard characters...

emmiesix has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regexing for non-standard characters... by ikegami (Patriarch) on Apr 15, 2010 at 19:58 UTC
Assuming you properly decoded your input, how does one find out what this stupid thing is `printf("chr(%d)\n", ord($ch)); # chr(8212) printf("chr(0x%04X)\n", ord($ch)); # chr(0x2014) printf("\"\\x{%04X}\"\n", ord($ch)); # "\x{2014}" printf("\"\\N{U+%04X}\"\n", ord($ch)); # "\N{U+2014}" use charnames (); printf("\"\\N{%s}\"\n", charnames::viacode(ord($ch))); # "\N{EM DASH}"` [download] how to regex for it? `$word =~ /\x{2014}/ $word =~ /\N{U+2014}/ use charnames ':full'; $word =~ /\N{EM DASH}/ use utf8; $word =~ /—/ # Encoded as UTF-8 in the source` [download] Update: Added crashtest's solution.	[reply] [d/l]
Re^2: regexing for non-standard characters... by crashtest (Curate) on Apr 15, 2010 at 23:06 UTC
An extra alternative: `use charnames ':full'; ... print (charnames::viacode( ord($ch))); # EM DASH` [download] Which is of course nothing you couldn't figure out using a Unicode lookup table (like "Character Map" on Windows) once you know the code point.	[reply] [d/l]
Re: regexing for non-standard characters... by AR (Friar) on Apr 15, 2010 at 19:49 UTC
Depending on your OS and tools available, you may be able to hexdump the file to see exactly what the character is. After that, you can name it in the regex by its hex value. You could also write a quick script to grab characters one by one (or `split //`) and print them out with their `ord` value.	[reply] [d/l] [select]
Re: regexing for non-standard characters... by Marshall (Canon) on Apr 16, 2010 at 03:31 UTC
This is called an "em dash". This shows the codes for this and how to insert them into various formats: en and em dash.	[reply]
Re: regexing for non-standard characters... by graff (Chancellor) on Apr 16, 2010 at 17:51 UTC
So how does one find out what this stupid thing is Try using this script on your data: unichist -- count/summarize characters in data -- it will show you a list of all the distinct code points, and how many times each one occurs. It expects utf8 text by default, but if your data comes in some other encoding, you can specify that in a command-line option ("--enc=..."); the output will always be in terms of unicode code points.	[reply]
Re^2: regexing for non-standard characters... by Jim (Curate) on Apr 17, 2010 at 19:45 UTC
unichist++ and graff++ Is unichist up to date (Unicode 5.2.0)? Jim	[reply]
Re^3: regexing for non-standard characters... by graff (Chancellor) on Apr 18, 2010 at 04:08 UTC
Is unichist up to date (Unicode 5.2.0)? That would depend on which perl version you are using to run it. Check the perldelta man page that comes with your version of perl. The 5.10.0 that came with my macosx 10.5 shows Unicode 5.0.0; I notice that the 5.10.1 has Unicode 5.1.0. I haven't checked http://unicode.org, but that would be the place to look if you need to know what the Unicode version differences consist of. (update:) Oh, wait... I remember that there's that section of the unichist code that "summarizes" the ranges of characters according to language/script "pages" -- I wouldn't expect Unicode updates to have any (significant) impact on that part of the script, but it's something I should check up on... Thanks for asking. (another update: the POD in unichist says that the list of code page "classes" was based on Unicode 5.0)	[reply]
Re^4: regexing for non-standard characters... by Jim (Curate) on Apr 19, 2010 at 02:59 UTC
Re^5: regexing for non-standard characters... by graff (Chancellor) on Jun 14, 2010 at 00:17 UTC
Re: regexing for non-standard characters... by rethaew (Sexton) on Apr 15, 2010 at 23:44 UTC
Not exactly on topic, but frequently when I'm dealing with lots of old and weird files and data with characters that are killing my scipts or causing other behavior, I just eliminate all the characters I do not need. Faster than trying to pinpoint which character is causing the problem. `$string =~ s/[^A-Za-z0-9]//g;` [download] If all I need is letters and numbers.	[reply] [d/l]
Re: regexing for non-standard characters... by Jim (Curate) on Apr 17, 2010 at 20:15 UTC
Everyone seems to have lept to the assumption that your "text file with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be in the Windows-1252 character encoding. The former is a multi-byte encoding and the latter is a single-byte encoding. The difference is fundamental. So, first, you need to know whether your text file is in some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252 character encoding—or even possibly in some other legacy encoding.	[reply]
Re^2: regexing for non-standard characters... by ikegami (Patriarch) on Apr 18, 2010 at 07:24 UTC
Everyone seems to have lept to the assumption that your "text file with some weird characters" in it is in the Unicode coded character set. ~~"Unicode coded character set" makes no sense.~~ ~~If you simply meant "Unicode character set"~~If you really meant "Unicode coded character set", then I don't see the problem. Unicode is the only character set understood by Perl builtins and its regex engine. (Well, maybe US-ASCII too depending on how you look at it.) It doesn't make any sense to talk about other character sets. But then you mention "Windows-1252 character encoding" as a possible alternative to "Unicode coded character set". That would make "Unicode coded character set" some kind of encoding, but a character set is not an encoding. Perhaps you meant "UTF-8 encoding". If you meant "UTF-8 encoding", then you're wrong about everyone assuming the input was encoded using UTF-8. I, for one, made no assumption whatsoever about the encoding of the input. (I did assume that `$word` contained text, but I stated that assumption.) The former is a multi-byte encoding and the latter is a single-byte encoding. The difference is fundamental. Not at all. If you want to deal with text, you have to decode the input. It doesn't matter one bit whether it's encoded using a single-byte fixed-width (e.g. Windows-1252), a multiple-byte fixed-width (e.g. UCS-2le) or a variable-width encoding (e.g. UTF-8, UTF-16le). So, first, you need to know whether your text file is in some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252 character encoding—or even possibly in some other legacy encoding. That should read: "First, you need to know the encoding of the text file (e.g. UTF-8, Windows-1252, etc)." Most definitely. In order to have text, you need to decode the input, and you can't do that until you know what encoding was used to produce those bytes.	[reply] [d/l]
Re^3: regexing for non-standard characters... by Jim (Curate) on Apr 18, 2010 at 17:37 UTC
Google "Unicode coded character set". Read the definition of Coded Character Set in the Unicode Consortium's Glossary of Unicode Terms. I suspect you're familiar with the term "coded character set" and that you're making some point about my usage of it here, but your point eludes me. I think it's correct usage to write "Unicode coded character set." I've read and reread your node and I earnestly don't understand the esoteric points you seem to be making about my phraseology. I especially don't understand your seemingly blantantly incorrect assertion that "Unicode is the only character set understood by Perl…and its regex engine." I've reread my own node and I think it's both correct and potentially helpful to emmiesix, who might be working with Perl on a computer running Microsoft Windows. You probably noticed my intentional use of several em dashes in my node. I'll use the text of my post to demonstrate…well…the sense of my post. C:\>chcp 1252 Active code page: 1252 C:\>type Windows-1252.txt Everyone seems to have lept to the assumption that your "text file with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be in the Windows-1252 character encoding. The former is a multi-byte encoding and the latter is a single-byte encoding. The difference is fundamental. So, first, you need to know whether your text file is in some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252 character encoding—or even possibly in some other legacy encoding. C:\>perl -ne "print if m/—/" Windows-1252.txt with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be character encoding—or even possibly in some other legacy encoding. C:\>perl -ne "print if m/\x97/" Windows-1252.txt with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be character encoding—or even possibly in some other legacy encoding. C:\>perl -ne "print if /\x{2014}/" Windows-1252.txt C:\>perl -ne "print if /\N{U+2014}/" Windows-1252.txt C:\>perl -mcharnames=:full -ne "print if /\N{EM DASH}/" Windows-1252.t +xt C:\>perl -ne "print if m/—/" UTF-8.txt C:\>perl -ne "print if m/\x97/" UTF-8.txt C:\>chcp 65001 Active code page: 65001 C:\>type UTF-8.txt Everyone seems to have lept to the assumption that your "text file with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be in the Windows-1252 character encoding. The former is a multi-byte encoding and the latter is a single-byte encoding. The difference is fundamental. So, first, you need to know whether your text file is in some encoding form of Unicode (e.g., UTF-8) or in the Windows-1252 character encoding—or even possibly in some other legacy encoding. C:\>perl -CiO -ne "print if m/—/" UTF-8.txt C:\>perl -CiO -ne "print if m/\x97/" UTF-8.txt C:\>perl -CiO -ne "use utf8; print if m/—/" UTF-8.txt Malformed UTF-8 character (unexpected continuation byte 0x97, with no +preceding start byte) at -e line 1. C:\>perl -CiO -ne "print if m/\x{2014}/" UTF-8.txt with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be character encoding—or even possibly in some other legacy encoding. C:\>perl -CiO -ne "print if m/\N{U+2014}/" UTF-8.txt C:\>perl -mcharnames=:full -CiO -ne "print if m/\N{EM DASH}/" UTF-8.tx +t with some weird characters" in it—an em dash, which is not so weird really—is in the Unicode coded character set. It may be, or it may be character encoding—or even possibly in some other legacy encoding. C:\> [download] No matter what I did to try to type a UTF-8 em dash into the Windows Command Prompt, it seems it wouldn't let me regardless of the code page setting. (Code page 65001 is UTF-8.) This is some defect of Windows, no doubt. I also don't know why the UTF-8 output appears double-spaced or why m/\x{2014}/ matches but m/\N{U+2014}/ doesn't. Maybe you do. The bottom line—and the point I was making to emmiesix—is that the "weird character" em dash could be one of several different things. What thing it is exactly (i.e., what byte or sequence of bytes) depends on the coded character set (CCS) and the character encoding form (CEF). One has to know if the text is Unicode (CCS) UTF-8 (CEF) or Windows-1252 (both CCS and CEF) or something else.	[reply] [d/l]
Re^4: regexing for non-standard characters... by ikegami (Patriarch) on Apr 18, 2010 at 20:46 UTC