international case insensitive searched with Perl

mwhiting has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: international case insensitive searched with Perl by CountZero (Bishop) on Oct 04, 2011 at 05:56 UTC
In my Strawberry Perl 5, version 12, subversion 1 (v5.12.1), it works, but only if I `use utf8;`. `use Modern::Perl; use utf8; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';` [download] Prints "matched". CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: international case insensitive searched with Perl by GrandFather (Saint) on Oct 04, 2011 at 06:02 UTC
In fact it works with 5.10 just using: `use strict; use warnings; use utf8; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';` [download] Prints: `matched` [download] True laziness is hard work	[reply] [d/l] [select]
Re: international case insensitive searched with Perl by Anonymous Monk on Oct 03, 2011 at 19:53 UTC
`#!perl use 5.014; use warnings; my $string = 'LÉGER'; print $string =~ /léger/i ? 'matched' : 'no match';` [download] Prints 'matched' for me. ... of course if you're using an older version of perl, you need to put a little more work into enforcing a match under Unicode rules. TJD	[reply] [d/l]
Re^2: international case insensitive searched with Perl by Anonymous Monk on Oct 03, 2011 at 20:05 UTC
More info: `use feature 'unicode_strings'` is the magic that works in 5.14. Check out 'The "Unicode Bug"' in `perldoc perlunicode` TJD	[reply] [d/l] [select]
Re^2: international case insensitive searched with Perl by chrestomanci (Priest) on Oct 04, 2011 at 09:13 UTC
There have been a lot of Unicode related improvements in Perl 5.14, so for this it is definitely worth trying to use the most recent perl you can. See perl5140delta#Full_functionality_for_use_feature_unicode_strings and The perl 5.14 release announcement on perl monks. The Download Perl page provides docs on how to download and install the latest perl. In summary if you are on Linux Mac or Unix then use Perl Brew to automate downloading and compiling perl. If you are on windows then ActiveState now have a build of Perl 5.14. There is no Strawberry Perl release of 5.14 yet.	[reply]
Re^2: international case insensitive searched with Perl by mwhiting (Beadle) on Oct 04, 2011 at 15:57 UTC
What kind of extra work is involved to make it match under unicode rules? I'm guessing it's a bit, so I might end up doing an customer-specific workaround to deal with it for them.	[reply]
Re^3: international case insensitive searched with Perl by Anonymous Monk on Oct 06, 2011 at 13:45 UTC
`use feature 'unicode_strings'` tells perl to assume all strings are unicode. `use utf8` tell perl to assume all strings in the current source file are unicode. Opening input files with a `:encoding` layer will tell perl that the resulting strings are unicode. `use encode` contains subs that can be used to mark or convert strings from any other source as/to unicode. Hope this helps TJD	[reply] [d/l] [select]
Re: international case insensitive searched with Perl by pvaldes (Chaplain) on Oct 03, 2011 at 19:47 UTC
A quick solution could be to use [] `my $search =~ /AnyPatt[éÉ]rn/i;`	[reply] [d/l]
Re: international case insensitive searched with Perl by Khen1950fx (Canon) on Oct 04, 2011 at 08:57 UTC
On my Linux system, I get "no match". Evidently, Perl doesn't consider them a match because they aren't a match. For example: `#!perl -sl strict; warnings; binmode STDOUT, ':encoding(utf8)'; print my $str1 = "L\311GER"; print my $str2 = "l\351ger";` [download] That's on 5.8.8; however, on 5.14x and above, Anonymous Monk's suggestion about `use feature 'unicode_strings'` worked for me. Update: fixed typo.	[reply] [d/l] [select]
Re: international case insensitive searched with Perl by mwhiting (Beadle) on Oct 04, 2011 at 15:58 UTC
Thanks for all your suggestions. The problem I'm running into is that this is a client's ISP & their version of Perl, which must be older (see my response to Anonymous Monk above). I'm not sure what version it is, but the 'use UTF8' and other suggestions don't work, they just give errors. I have sent an email to ask if they would upgrade, but I'm not hopeful knowing how isp's are.	[reply]
Re^2: international case insensitive searched with Perl by Anonymous Monk on Oct 06, 2011 at 13:32 UTC
If you're using the OS installed perl, then an upgrade of it would be unwise. If the ISP has a 2nd install of perl for customer use, great. Otherwise, does your customer have enough space in their account to install their own copy of perl? This could be a great solution. Just don't install it in a directory that is directly accessible by the web server and it's clients. TJD	[reply]
Re: international case insensitive searched with Perl by DrHyde (Prior) on Oct 06, 2011 at 09:42 UTC
Perhaps you could give an example of what "international characters" are, and tell us what is a non-international character.	[reply]
Re^2: international case insensitive searched with Perl by afoken (Chancellor) on Oct 06, 2011 at 17:57 UTC
Usually, people call symbols outside the set of the ASCII characters 32 to 126 "international characters". In the pre-Unicode times, the narrow definition was all characters with a code between 128 and 255, in one or more of the several ASCII extensions (ISO Latin-XXX, machine-specific character sets). Now, the narrow definition is all Unicode characters except those also defined as ASCII 32 to 126. The wider definition is and was always "every character used somewhere at some point in time". Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^3: international case insensitive searched with Perl by DrHyde (Prior) on Oct 07, 2011 at 09:54 UTC
Usually? No they don't. They usually call them "non-ASCII characters" or "Unicode characters" - the latter being inaccurate because the ASCII characters are also in Unicode. And even if people did usually call non-ASCII characters "international characters" it's still inaccurate and therefore not helpful, because "d" is both an ASCII and an international character. You can see how international it is by referring to a French, German, English, Spanish, Vietnamese, Polish etc dictionary.	[reply]