Re: international case insensitive searched with Perl
by CountZero (Bishop) on Oct 04, 2011 at 05:56 UTC
|
In my Strawberry Perl 5, version 12, subversion 1 (v5.12.1), it works, but only if I use utf8;. use Modern::Perl;
use utf8;
my $string = 'LÉGER';
print $string =~ /léger/i ? 'matched' : 'no match';
Prints "matched".
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
use strict;
use warnings;
use utf8;
my $string = 'LÉGER';
print $string =~ /léger/i ? 'matched' : 'no match';
Prints:
matched
True laziness is hard work
| [reply] [d/l] [select] |
Re: international case insensitive searched with Perl
by Anonymous Monk on Oct 03, 2011 at 19:53 UTC
|
#!perl
use 5.014;
use warnings;
my $string = 'LÉGER';
print $string =~ /léger/i ? 'matched' : 'no match';
Prints 'matched' for me.
... of course if you're using an older version of perl, you need to put a little more work into enforcing a match under Unicode rules.
TJD | [reply] [d/l] |
|
|
| [reply] [d/l] [select] |
|
|
| [reply] |
|
|
What kind of extra work is involved to make it match under unicode rules? I'm guessing it's a bit, so I might end up doing an customer-specific workaround to deal with it for them.
| [reply] |
|
|
use feature 'unicode_strings' tells perl to assume all strings are unicode.
use utf8 tell perl to assume all strings in the current source file are unicode.
Opening input files with a :encoding layer will tell perl that the resulting strings are unicode.
use encode contains subs that can be used to mark or convert strings from any other source as/to unicode.
Hope this helps
TJD
| [reply] [d/l] [select] |
Re: international case insensitive searched with Perl
by pvaldes (Chaplain) on Oct 03, 2011 at 19:47 UTC
|
| [reply] [d/l] |
Re: international case insensitive searched with Perl
by Khen1950fx (Canon) on Oct 04, 2011 at 08:57 UTC
|
On my Linux system, I get "no match". Evidently, Perl doesn't
consider them a match because they aren't a match. For example:
#!perl -sl
strict;
warnings;
binmode STDOUT, ':encoding(utf8)';
print my $str1 = "L\311GER";
print my $str2 = "l\351ger";
That's on 5.8.8; however, on 5.14x and above, Anonymous Monk's suggestion about use feature 'unicode_strings' worked for me.
Update: fixed typo.
| [reply] [d/l] [select] |
Re: international case insensitive searched with Perl
by mwhiting (Beadle) on Oct 04, 2011 at 15:58 UTC
|
Thanks for all your suggestions. The problem I'm running into is that this is a client's ISP & their version of Perl, which must be older (see my response to Anonymous Monk above). I'm not sure what version it is, but the 'use UTF8' and other suggestions don't work, they just give errors. I have sent an email to ask if they would upgrade, but I'm not hopeful knowing how isp's are.
| [reply] |
|
|
If you're using the OS installed perl, then an upgrade of it would be unwise. If the ISP has a 2nd install of perl for customer use, great. Otherwise, does your customer have enough space in their account to install their own copy of perl? This could be a great solution. Just don't install it in a directory that is directly accessible by the web server and it's clients.
TJD
| [reply] |
Re: international case insensitive searched with Perl
by DrHyde (Prior) on Oct 06, 2011 at 09:42 UTC
|
Perhaps you could give an example of what "international characters" are, and tell us what is a non-international character. | [reply] |
|
|
Usually, people call symbols outside the set of the ASCII characters 32 to 126 "international characters". In the pre-Unicode times, the narrow definition was all characters with a code between 128 and 255, in one or more of the several ASCII extensions (ISO Latin-XXX, machine-specific character sets). Now, the narrow definition is all Unicode characters except those also defined as ASCII 32 to 126. The wider definition is and was always "every character used somewhere at some point in time".
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
|
Usually? No they don't. They usually call them "non-ASCII characters" or "Unicode characters" - the latter being inaccurate because the ASCII characters are also in Unicode.
And even if people did usually call non-ASCII characters "international characters" it's still inaccurate and therefore not helpful, because "d" is both an ASCII and an international character. You can see how international it is by referring to a French, German, English, Spanish, Vietnamese, Polish etc dictionary.
| [reply] |