Mur has asked for the wisdom of the Perl Monks concerning the following question:

I should be able to solve this myself, but so far I haven't come up with any wiley approaches nor any magic combinations of keywords at Google:

How can I coerce Perl's \b pattern to recognize and respect accented characters?

my $land = 'Mexico'; my @words = split(/\b/,$land);
This does what you'd expect, and scalar(@words)==1. However, if you change the 'e' in Mexico to an accented e (México), then it splits before and after the accented 'e', and scalar(@words)==3.

use utf8; works as long as the literal appears in the text; reading in the text from outside the code doesn't work. E.g.,

$ perl -Mutf8 -e "print join(q{,},split(/\b/,\$ARGV[0])),qq{\n}" méxic +o m,é,xico $ perl -Mutf8 -e "print join(q{,},split(/\b/,q{méxico})),qq{\n}" Wide character in print at -e line 1. méxico
--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Stuck in accent-land
by bart (Canon) on Dec 30, 2003 at 18:27 UTC
    You should be looking into locale. However, I don't know anything about the details, on how it works. For one, I don't know how customizable it is. It looks like it is not.
      Yep, "locale" it is. I conjured a few more keywords into my Google search and turned up this:

      Perl, Unicode and i18N FAQ: http://rf.net/~james/perli18n.html

      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
Re: Stuck in accent-land
by ysth (Canon) on Dec 30, 2003 at 21:04 UTC
    Perl will do the right thing with data marked as utf8. The problem occurs when you have data not marked as utf8. Then use locale may be the answer; that will determine wordiness of characters based on your locale environment variables.

    If you are stuck on a platform that only has the minimum C support for locales (such as cygwin) you need to upgrade the data to utf8 instead (by appending and removing a wide character or by utf8::upgrade). Excerpt from utf8::upgrade pod:

    * $num_octets = utf8::upgrade($string) Converts (in-place) internal representation of string to Perl's internal UTF-X form. Returns the number of octets necessary to represent the string as UTF-X. Can be used to make sure that the UTF-8 flag is on, so that "\w" or "lc()" work as expected on strings containing characters in the range 0x80-0xFF (oon ASCII and derivatives). Note that this should not be used to convert a legacy byte encoding to Unicode: use Encode for that. Affected by the encoding pragma.
Re: Stuck in accent-land
by pg (Canon) on Dec 30, 2003 at 18:45 UTC

    When I ran your second last example, it gives me:

    SCALAR,(,0x155ac5c,

    I believe what you really wanted to say was:

    perl -Mutf8 -e "print join(q{,},split(/\b/,$ARGV[0])),qq{\n}" méxico

    There should be no \ in front of $ARGV.

    Update:

    Roy Johnson is right, and I was on win32.

      That depends on whether your OS interpolates variables in doublequotes.

      The PerlMonk tr/// Advocate
Re: Stuck in accent-land
by dominix (Deacon) on Dec 31, 2003 at 11:36 UTC
    ...Nexcerpt...Connecting People With Expertise
    ................................. from perlmonks ?

    just kiddin' :-)
    --
    dominix