in reply to Regular Expressions, ignore case and unicode
Use a new perl - I've tested it with perl-5.14.0-RC1, and it all works there.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Regular Expressions, ignore case and unicode
by OlegG (Monk) on Apr 21, 2011 at 17:44 UTC | |
I'll try to test new perl version. | [reply] |
by moritz (Cardinal) on Apr 21, 2011 at 18:22 UTC | |
Hmm, Do You think this is a perl bug? Yes. And fixed somewhere between 5.12.2 and 5.14.0-RC1. Curious fact: binmode STDOUT, ':encoding(UTF-8)'; print 'Бб' =~ /^(а-я+)/i ? "regexp ok '$1'" : 'regexp fail', "\n"; # PS: perlmonks does't really do Unicode, it's your second example # but without the $ Prints "regexp ok 'Бб'" - so it matched the whole string, and only the $ failed. | [reply] |
by tchrist (Pilgrim) on Apr 21, 2011 at 19:26 UTC | |
perlmonks does't really do UnicodeWell, it does — it’s just insanely difficult to get right, and there are no directions for doing so anywhere they belong. If you look at the raw text of my answer (somehow!), you’ll see how to go about it. Here is what the raw code looks like. It has to go in a <pre> tag, but you have to escape all kinds of silly stuff: All that’s indeed necessary to get it to look like right, which is like this: % head /tmp/t? ==> /tmp/t1 <== use utf8; print 'бБ' =~ /^[а-я]+$/i ? 'regexp ok' : 'regexp fail', "\n"; Isn’t that um, extreme? It’s bad enough that they make you type in all that raw ʜᴛᴍʟ, but you have to go and escape stuff that even ʜᴛᴍʟ doesn’t require escaping. It’s bad enough that they go and escape your \P{ASCII}, but then they make it so what they’ve escaped no longer works! So their escaping breaks your postings. It doesn’t work because they don’t count it in a <code> block. Then when you make it a <pre>, they go gonzo on your square brackets. Sheesh! 😖 It is so ridiculously complicated to post here that I am sometimes amazed people do so. The rules aren’t explained: you have to figure them out on your own. And once you do, it is a typographical nightmare. The only reasonable solution, since not even <code> or <pre> tags work even close to correctly, is to write a program to translate postings into something that the site doesn’t screw up on. At which point, it becomes clear that the fix is being applied in the wrong place, eh?
Unicode EmoticonsBTW, if you want the fancy Unicode 6 emotica to show up right, like these:👾 U+1F47E ALIEN MONSTER 👽 U+1F47D EXTRATERRESTRIAL ALIEN 🐼 U+1F43C PANDA FACE 🐸 U+1F438 FROG FACE 🐷 U+1F437 PIG FACE 🐵 U+1F435 MONKEY FACE 🐰 U+1F430 RABBIT FACE 🐮 U+1F42E COW FACE 🐭 U+1F42D MOUSE FACE 😇 U+1F607 SMILING FACE WITH HALO 😈 U+1F608 SMILING FACE WITH HORNS 😋 U+1F60B FACE SAVOURING DELICIOUS FOOD 😏 U+1F60F SMIRKING FACE 😒 U+1F612 UNAMUSED FACE 😓 U+1F613 FACE WITH COLD SWEAT 😖 U+1F616 CONFOUNDED FACE 😜 U+1F61C FACE WITH STUCK-OUT TONGUE AND WINKING EYE 😞 U+1F61E DISAPPOINTED FACE 😠 U+1F620 ANGRY FACE 😡 U+1F621 POUTING FACE 😤 U+1F624 FACE WITH LOOK OF TRIUMPH 😥 U+1F625 DISAPPOINTED BUT RELIEVED FACE 😭 U+1F62D LOUDLY CRYING FACE 😱 U+1F631 FACE SCREAMING IN FEAR 💆 U+1F486 FACE MASSAGE 🙅 U+1F645 FACE WITH NO GOOD GESTURE 🙆 U+1F646 FACE WITH OK GESTURE 🙈 U+1F648 SEE-NO-EVIL MONKEY 🙉 U+1F649 HEAR-NO-EVIL MONKEY 🙊 U+1F64A SPEAK-NO-EVIL MONKEY 🐵 U+1F435 MONKEY FACE 😺 U+1F63A SMILING CAT FACE WITH OPEN MOUTH 😻 U+1F63B SMILING CAT FACE WITH HEART-SHAPED EYES 😼 U+1F63C CAT FACE WITH WRY SMILE 😹 U+1F639 CAT FACE WITH TEARS OF JOY 😾 U+1F63E POUTING CAT FACEThen you should install George Douros’s Symbola font. It includes all the crazy stuff they added for Unicode 6. It also nicely covers all the mathematical letters up in the SMP. There are just scads of them:
The last shows you that there are 165 new \w code points in Unicode 6.0 in the BMP, and 817 of them if you include the astral planes. All answers are from using some of the utilities from training.perl.com/scripts when run under Perl 5.14 (currently in RC1), which supports Unicode 6.0.
I hope this posting has shown how you can do Unicode on perlmonks, and also how to find some fun ones to use. ☘ | [reply] [d/l] [select] |
by OlegG (Monk) on Apr 21, 2011 at 18:51 UTC | |
Yes. I can confirm. "regexp ok" with just compiled perl 5.14 RC1. Please tell me, should I report this bug to perl developers? Any chance to fix it in previous perl versions? | [reply] |
by moritz (Cardinal) on Apr 21, 2011 at 21:30 UTC | |
by tchrist (Pilgrim) on Apr 21, 2011 at 18:40 UTC | |
Hmm, Do You think this is a perl bug?Yes, it was a Perl bug. There were issues with how case insensitive matching worked within bracketed character class ranges. Here is the demo that shows it used to not work, and now does.
The problem evaporates upon upgrading.
You can get the uniquote script demo’d above, plus several other (mostly) Unicode-oriented tools, from training.perl.com/scripts. Most are in varying states of pre-release-ness, but all of them do get used nearly daily. | [reply] |