Stuck in accent-land

Mur has asked for the wisdom of the Perl Monks concerning the following question:

I should be able to solve this myself, but so far I haven't come up with any wiley approaches nor any magic combinations of keywords at Google:

How can I coerce Perl's \b pattern to recognize and respect accented characters?

my $land = 'Mexico';
my @words = split(/\b/,$land);
[download]

This does what you'd expect, and scalar(@words)==1. However, if you change the 'e' in Mexico to an accented e (México), then it splits before and after the accented 'e', and scalar(@words)==3.

use utf8; works as long as the literal appears in the text; reading in the text from outside the code doesn't work. E.g.,

$ perl -Mutf8 -e "print join(q{,},split(/\b/,\$ARGV[0])),qq{\n}" méxic
+o
m,é,xico

$ perl -Mutf8 -e "print join(q{,},split(/\b/,q{méxico})),qq{\n}"
Wide character in print at -e line 1.
méxico
[download]

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

Comment on Stuck in accent-land Select or Download Code

Replies are listed 'Best First'.

Re: Stuck in accent-land
by bart (Canon) on Dec 30, 2003 at 18:27 UTC

locale

[reply]

Re: Re: Stuck in accent-land

by Mur (Pilgrim) on Dec 30, 2003 at 18:47 UTC

Perl, Unicode and i18N FAQ: http://rf.net/~james/perli18n.html

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]

Re: Stuck in accent-land
by ysth (Canon) on Dec 30, 2003 at 21:04 UTC

If you are stuck on a platform that only has the minimum C support for locales (such as cygwin) you need to upgrade the data to utf8 instead (by appending and removing a wide character or by utf8::upgrade). Excerpt from utf8::upgrade pod:

* $num_octets = utf8::upgrade($string) Converts (in-place) internal representation of string to Perl's internal UTF-X form. Returns the number of octets necessary to represent the string as UTF-X. Can be used to make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as expected on strings containing characters in the range 0x80-0xFF (oon ASCII and derivatives). Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that. Affected by the encoding pragma.

[reply]

Re: Stuck in accent-land
by pg (Canon) on Dec 30, 2003 at 18:45 UTC

When I ran your second last example, it gives me:

SCALAR,(,0x155ac5c,
[download]

I believe what you really wanted to say was:

perl -Mutf8 -e "print join(q{,},split(/\b/,$ARGV[0])),qq{\n}" méxico
[download]

There should be no \ in front of $ARGV.

Update:

Roy Johnson is right, and I was on win32.

[reply]
[d/l]
[select]

Re: Re: Stuck in accent-land

by Roy Johnson (Monsignor) on Dec 30, 2003 at 19:10 UTC

The PerlMonk tr/// Advocate

[reply]

Re: Stuck in accent-land
by dominix (Deacon) on Dec 31, 2003 at 11:36 UTC

from perlmonks ?

--
dominix

[reply]